一、学生考试数据分析

1.背景描述

本数据集为模拟数据，数据包含了公立学校学生的三门考试分数以及可能对他们产生交互影响的各种个人和社会经济因素。

2.数据说明

gender 性别
race/ethnicity 种族
parental level of education 父母教育水平
lunch 午餐
test preparation course
math score 数学
reading score 阅读
writting score 写作

3.数据来源

roycekimmons.com/tools/gener…

二、数据分析

1.数据读取

常见的编码格式包括utf-8、gbk、gb2312等。

常用encoding='utf-8'
quotechar转义字符，例如转义字符为双引号 quotechar='"'

此次数据集utf-8会报错：

读文件报错：
UnicodeDecodeError: ‘gb2312’ codec can’t decode byte 0xa9 in position 8221: illegal multibyte sequence

对此可以使用 encoding='gb2312'。
如果文件中有繁体字，则使用gbk读取。

import pandas as pd
data=pd.read_csv("data/data222411/Students_Exam_Scores.csv", encoding='gb2312')
data.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	序号	性别	分组	父母教育背景	午餐类型	完成备考课程	父母婚恋状态	参与运动的频率	是否是第一个孩子	兄弟姐妹数量	上学交通工具	每周自习时间	数学成绩	阅读成绩	写作成绩
0	0	女	NaN	学士学位	标准	没有完成	已婚	经常	是	3.0	校车	小于5	71	71	74
1	1	女	C组	高中肄业	标准	NaN	已婚	有时	是	0.0	NaN	5到10	69	90	88
2	2	女	B组	硕士学位	标准	没有完成	单身	有时	是	4.0	校车	小于5	87	93	91
3	3	男	A组	双学位学士	免费/降价	没有完成	已婚	从不	否	1.0	NaN	5到10	45	56	42
4	4	男	C组	高中肄业	标准	没有完成	已婚	有时	是	0.0	校车	5到10	76	78	75

2.缺失值分析

python的pandas库中有一个十分便利的isnull()函数，它可以用来判断缺失值。

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(10,99,size=(10,5)))
df.iloc[4:6,0] = np.nan
df.iloc[5:7,2] = np.nan
df.iloc[7,3] = np.nan
df.iloc[2:3,4] = np.nan
df

	0	1	2	3	4
0	15.0	29	13.0	94.0	58.0
1	84.0	56	94.0	27.0	50.0
2	51.0	18	71.0	53.0	NaN
3	58.0	60	29.0	51.0	90.0
4	NaN	23	26.0	12.0	82.0
5	NaN	30	NaN	66.0	57.0
6	89.0	90	NaN	32.0	62.0
7	25.0	53	48.0	NaN	40.0
8	77.0	31	54.0	89.0	52.0
9	57.0	54	55.0	40.0	36.0

df.isnull()

	0	1	2	3	4
0	False	False	False	False	False
1	False	False	False	False	False
2	False	False	False	False	True
3	False	False	False	False	False
4	True	False	False	False	False
5	True	False	True	False	False
6	False	False	True	False	False
7	False	False	False	True	False
8	False	False	False	False	False
9	False	False	False	False	False

2.1缺失值统计 isnull().sum()

isnull().sum() 统计每列缺失值的数量。

df.isnull().sum()

0    2
1    0
2    2
3    1
4    1
dtype: int64

可见每列缺失值都给统计出来了

2.2缺失值统计 isnull().any()

isnull().any()会判断哪些列包含缺失值。

df.isnull().any()

0     True
1    False
2     True
3     True
4     True
dtype: bool

2.3缺失值统计

# 查看数据是否包括缺失值
data.isnull().sum()

序号             0
性别             0
分组          1840
父母教育背景      1845
午餐类型           0
完成备考课程      1830
父母婚恋状态      1190
参与运动的频率      631
是否是第一个孩子     904
兄弟姐妹数量      1572
上学交通工具      3134
每周自习时间       955
数学成绩           0
阅读成绩           0
写作成绩           0
dtype: int64

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30641 entries, 0 to 30640
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   序号        30641 non-null  int64  
 1   性别        30641 non-null  object 
 2   分组        28801 non-null  object 
 3   父母教育背景    28796 non-null  object 
 4   午餐类型      30641 non-null  object 
 5   完成备考课程    28811 non-null  object 
 6   父母婚恋状态    29451 non-null  object 
 7   参与运动的频率   30010 non-null  object 
 8   是否是第一个孩子  29737 non-null  object 
 9   兄弟姐妹数量    29069 non-null  float64
 10  上学交通工具    27507 non-null  object 
 11  每周自习时间    29686 non-null  object 
 12  数学成绩      30641 non-null  int64  
 13  阅读成绩      30641 non-null  int64  
 14  写作成绩      30641 non-null  int64  
dtypes: float64(1), int64(4), object(10)
memory usage: 3.5+ MB

3.性别分析

3.1 sample随机抽取

给定一个包含 N 行的dataframe，随机采样从dataframe中提取 X 随机行。
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)
Parameters n:int, optional

Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.

fracfloat, optional

Fraction of axis items to return. Cannot be used with n.

replacebool, default False

Allow or disallow sampling of the same row more than once.

weightsstr or ndarray-like, optional

Default ‘None’ results in equal probability weighting. If passed a Series, will align with target object on index. Index values in weights not found in sampled object will be ignored and index values in sampled object not in weights will be assigned weights of zero. If called on a DataFrame, will accept the name of a column when axis = 0. Unless weights are a Series, weights must be same length as axis being sampled. If weights do not sum to 1, they will be normalized to sum to 1. Missing values in the weights column will be treated as zero. Infinite values not allowed.

random_stateint, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional

If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given.

Changed in version 1.1.0: array-like and BitGenerator object now passed to np.random.RandomState() as seed

Changed in version 1.4.0: np.random.Generator objects now accepted

axis{0 or ‘index’, 1 or ‘columns’, None}, default None

Axis to sample. Accepts axis number or name. Default is stat axis for given data type. For Series this parameter is unused and defaults to None.

ignore_indexbool, default False

If True, the resulting index will be labeled 0, 1, …, n - 1.

New in version 1.3.0.

Returns

Series or DataFrame

A new object of same type as caller containing n items randomly sampled from the caller object.

主要参数摘要：

n整数，可选

从轴返回的项目数。不能与frac一起使用。默认值 = 1 如果frac = 无。

frac浮动，可选

要返回的轴项目的分数。不能与n一起使用。

参考地址: pandas.pydata.org/pandas-docs…

data.sample(10)

	序号	性别	分组	父母教育背景	午餐类型	完成备考课程	父母婚恋状态	参与运动的频率	是否是第一个孩子	兄弟姐妹数量	上学交通工具	每周自习时间	数学成绩	阅读成绩	写作成绩
11322	1	女	B组	高中肄业	标准	没有完成	单身	经常	否	3.0	校车	小于5	58	64	68
20074	359	男	D组	NaN	免费/降价	完成	已婚	有时	否	3.0	校车	5到10	69	77	74
16386	424	男	C组	高中	标准	NaN	已婚	经常	是	3.0	校车	5到10	61	48	49
9635	195	男	C组	学士学位	标准	没有完成	已婚	有时	否	3.0	校车	5到10	80	65	70
24876	510	男	D组	NaN	免费/降价	完成	已婚	有时	是	1.0	校车	5到10	63	58	60
6515	873	女	A组	高中肄业	免费/降价	没有完成	已婚	有时	是	2.0	私家车	5到10	53	65	58
12718	493	女	NaN	硕士学位	标准	没有完成	NaN	从不	是	0.0	私家车	5到10	49	48	48
28377	274	男	B组	高中肄业	免费/降价	完成	已婚	有时	是	4.0	校车	5到10	51	51	58
17820	952	男	A组	高中	标准	没有完成	已婚	经常	否	4.0	校车	5到10	62	48	48
22666	148	男	C组	学士学位	免费/降价	没有完成	丧偶	经常	否	1.0	NaN	5到10	64	64	61

3.2 value_counts()统计

用来统计数据表中，指定列里有多少个不同的数据值，并计算每个不同值有在该列中的个数，同时还能根据指定得参数返回排序后结果。返回Series对象。

DataFrame.value_counts(*subset=None*, *normalize=False*, *sort=True*, *ascending=False*, *dropna=True*)

参数：

subsetlabel or list of labels, optional

Columns to use when counting unique combinations.

normalizebool, default False

Return proportions rather than frequencies.

sortbool, default True

Sort by frequencies.

ascendingbool, default False

Sort in ascending order.

dropnabool, default True

Don’t include counts of rows that contain NA values.

摘要：

subset：标签列，主要是列
normalize 比例
dropna 是否包含NA行

4.总分排序sort_values

DataFrame.sort_values(by, *** , axis=0, ascending=True, inplace=False, kind='quicksort' , na_position='last' , ignore_index=False, key=None)

参数

bystr or list of str

Name or list of names to sort by.

if axis is 0 or ‘index’ then by may contain index levels and/or column labels.
if axis is 1 or ‘columns’ then by may contain column levels and/or index labels.

axis{0 or ‘index’, 1 or ‘columns’}, default 0

Axis to be sorted.

ascendingbool or list of bool, default True

Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.

inplacebool, default False

If True, perform operation in-place.

kind{‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, default ‘quicksort’

Choice of sorting algorithm. See also numpy.sort() for more information. mergesort and stable are the only stable algorithms. For DataFrames, this option is only applied when sorting on a single column or label.

na_position{‘first’, ‘last’}, default ‘last’

Puts NaNs at the beginning if first; last puts NaNs at the end.

ignore_indexbool, default False

If True, the resulting axis will be labeled 0, 1, …, n - 1.

keycallable, optional

Apply the key function to the values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.

摘要：

by 需要排序的列
anscending 升序排列
inplace 是否执行操作
kind 排序方法

pandas.pydata.org/pandas-docs…

data['total_score'] = data['数学成绩'] + data['阅读成绩'] + data['写作成绩']

 # 排序
sorted_df = data.sort_values(by='total_score', ascending=False)

5.删除空值列

删除缺失值

DataFrame.dropna( *** , axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False, ignore_index=False)

参数

axis{0 or ‘index’, 1 or ‘columns’}, default 0

Determine if rows or columns which contain missing values are removed.

0, or ‘index’ : Drop rows which contain missing values.
1, or ‘columns’ : Drop columns which contain missing value.

Pass tuple or list to drop on multiple axes. Only a single axis is allowed.

how{‘any’, ‘all’}, default ‘any’

Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

‘any’ : If any NA values are present, drop that row or column.
‘all’ : If all values are NA, drop that row or column.

threshint, optional

Require that many non-NA values. Cannot be combined with how.

subsetcolumn label or sequence of labels, optional

Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

inplacebool, default False

Whether to modify the DataFrame rather than creating a new one.

ignore_indexbool, default False

If True, the resulting axis will be labeled 0, 1, …, n - 1.

摘要：

axis 轴向，0行，1列
how any，删除列或行；all所有值为NA才删除。
inplace，是否修改原dataframe

# 我没直接改，就直接替换了，否则inplace=True
data=data.dropna()

三、什么因素(特征)对考试成绩影响最大?

from matplotlib import pyplot as plt
import seaborn
%matplotlib inline

# 替换法，最简单，但是多值就很麻烦了
# data['性别'].replace({"男":1,"女":0},inplace=True)
# df.LUNG_CANCER.replace({"YES":1,"NO":0},inplace=True)

1.序列化 LabelEncoder

`fit`(y)	Fit label encoder.
`fit_transform`(y)	Fit label encoder and return encoded labels.
`get_params`([deep])	Get parameters for this estimator.
`inverse_transform`(y)	Transform labels back to original encoding.
`set_output`(*[, transform])	Set output container.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(y)	Transform labels to normalized encoding.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

columns=data.columns[1:]
print(len(columns))
for column in columns:
    print(column)

15
性别
分组
父母教育背景
午餐类型
完成备考课程
父母婚恋状态
参与运动的频率
是否是第一个孩子
兄弟姐妹数量
上学交通工具
每周自习时间
数学成绩
阅读成绩
写作成绩
total_score

for column in columns:
    print(f"完成 {column} 列序列化")
    data[column]=le.fit_transform(data[column])

完成 性别 列序列化
完成 分组 列序列化
完成 父母教育背景 列序列化
完成 午餐类型 列序列化
完成 完成备考课程 列序列化
完成 父母婚恋状态 列序列化
完成 参与运动的频率 列序列化
完成 是否是第一个孩子 列序列化
完成 兄弟姐妹数量 列序列化
完成 上学交通工具 列序列化
完成 每周自习时间 列序列化
完成 数学成绩 列序列化
完成 阅读成绩 列序列化
完成 写作成绩 列序列化
完成 total_score 列序列化

参考文档：scikit-learn.org/stable/modu…

2.相关性分析

DataFrame.corr(method='pearson' , min_periods=1, numeric_only=False)

数据之间有关联，相互有影响。

相关性系数：衡量相关性强弱的

其范围是[-1,1]
绝对值越靠近0，表示不相关
绝对值越靠近1，表示相关性越强
小于 0 表示负相关
大于 0 表示正相关 Parameters

参数：

method{‘pearson’, ‘kendall’, ‘spearman’} or callable

Method of correlation:

pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
callable: callable with input two 1d ndarrays

and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

min_periodsint, optional

Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.

numeric_onlybool, default False

Include only float, int or boolean data.

摘要：

method，不同的相关性计算方法。

参考：

pandas.pydata.org/pandas-docs…

from matplotlib import pyplot as plt
import seaborn
%matplotlib inline


#计算相关系数
data.drop(['序号','数学成绩','阅读成绩','写作成绩'], axis=1,inplace=True)
cor = data.corr()
cor=abs(cor)
print(cor)
plt.figure(figsize=(20,20))
#画出热力图
seaborn.heatmap(cor,vmin=0,vmax=3,center=0)

#必须要配合matplotlib才可使用
plt.title("heatmap")
plt.xlabel("x_ticks")
plt.ylabel("y_ticks")
plt.show()

                   性别        分组    父母教育背景      午餐类型    完成备考课程    父母婚恋状态  \
性别           1.000000  0.000657  0.001845  0.000329  0.008502  0.002671   
分组           0.000657  1.000000  0.001485  0.004800  0.004264  0.003106   
父母教育背景       0.001845  0.001485  1.000000  0.011071  0.001252  0.009902   
午餐类型         0.000329  0.004800  0.011071  1.000000  0.000008  0.012473   
完成备考课程       0.008502  0.004264  0.001252  0.000008  1.000000  0.002443   
父母婚恋状态       0.002671  0.003106  0.009902  0.012473  0.002443  1.000000   
参与运动的频率      0.009384  0.002376  0.007822  0.010888  0.002585  0.000079   
是否是第一个孩子     0.005551  0.000088  0.002527  0.005265  0.002498  0.008034   
兄弟姐妹数量       0.001414  0.001410  0.001928  0.000482  0.002697  0.009180   
上学交通工具       0.009915  0.010716  0.000292  0.000647  0.017227  0.002100   
每周自习时间       0.001296  0.000836  0.007309  0.005530  0.002676  0.004833   
total_score  0.131516  0.189610  0.152230  0.320430  0.227731  0.003902   

              参与运动的频率  是否是第一个孩子    兄弟姐妹数量    上学交通工具    每周自习时间  total_score  
性别           0.009384  0.005551  0.001414  0.009915  0.001296     0.131516  
分组           0.002376  0.000088  0.001410  0.010716  0.000836     0.189610  
父母教育背景       0.007822  0.002527  0.001928  0.000292  0.007309     0.152230  
午餐类型         0.010888  0.005265  0.000482  0.000647  0.005530     0.320430  
完成备考课程       0.002585  0.002498  0.002697  0.017227  0.002676     0.227731  
父母婚恋状态       0.000079  0.008034  0.009180  0.002100  0.004833     0.003902  
参与运动的频率      1.000000  0.010253  0.014042  0.002407  0.007259     0.050950  
是否是第一个孩子     0.010253  1.000000  0.120100  0.005665  0.001562     0.006096  
兄弟姐妹数量       0.014042  0.120100  1.000000  0.008960  0.000671     0.003232  
上学交通工具       0.002407  0.005665  0.008960  1.000000  0.010702     0.002832  
每周自习时间       0.007259  0.001562  0.000671  0.010702  1.000000     0.043298  
total_score  0.050950  0.006096  0.003232  0.002832  0.043298     1.000000

print(cor['total_score'])

性别             0.131516
分组             0.189610
父母教育背景         0.152230
午餐类型           0.320430
完成备考课程         0.227731
父母婚恋状态         0.003902
参与运动的频率        0.050950
是否是第一个孩子       0.006096
兄弟姐妹数量         0.003232
上学交通工具         0.002832
每周自习时间         0.043298
total_score    1.000000
Name: total_score, dtype: float64

3. 结论

午餐类型、完成备考课程、分组、父母教育背景与成绩有非常明显的相关性，重要性从高到低！！！

四、建立学生成绩预测模型

1.构造x，y

# 构造X、y
X=data.drop(columns=["total_score"],axis=1)
y=data["total_score"]

2.数据集划分

2.1 train_test_split()用法

分割train和test数据集

sklearn.model_selection.train_test_split( *arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)

2.2 参数描述

Parameters:
- arrayssequence of indexables with same length / shape[0]*
  
  Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
- test_sizefloat or int, default=None
  
  If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
- train_sizefloat or int, default=None
  
  If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
- random_stateint, RandomState instance or None, default=None
  
  Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. See Glossary.
- shufflebool, default=True
  
  Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.
- stratifyarray-like, default=None
  
  If not None, data is split in a stratified fashion, using this as the class labels. Read more in the User Guide.
Returns:
- splittinglist, length=2 * len(arrays)
  
  List containing train-test split of inputs.New in version 0.16: If the input is sparse, the output will be a scipy.sparse.csr_matrix. Else, output type is the same as the input type.

摘要：

train_size, test_size 比例，分割比例
random_state 随机数呗
shuffle bool 随机打乱

scikit-learn.org/stable/modu…

from sklearn.model_selection import train_test_split,cross_val_score

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=2023)

3.数据标准化

class sklearn.preprocessing.StandardScaler( *** , copy=True, with_mean=True, with_std=True)

参数:

copybool, default=True

If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
with_meanbool, default=True

If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
with_stdbool, default=True

If True, scale the data to unit variance (or equivalently, unit standard deviation).

Attributes:

scale_ndarray of shape (n_features,) or None

Per feature relative scaling of the data to achieve zero mean and unit variance. Generally this is calculated using np.sqrt(var_). If a variance is zero, we can’t achieve unit variance, and the data is left as-is, giving a scaling factor of 1. scale_ is equal to None when with_std=False.New in version 0.17: scale_
mean_ndarray of shape (n_features,) or None

The mean value for each feature in the training set. Equal to None when with_mean=False.
var_ndarray of shape (n_features,) or None

The variance for each feature in the training set. Used to compute scale_. Equal to None when with_std=False.
n_features_in_int

Number of features seen during fit.New in version 0.24.
feature_names_in_ndarray of shape (n_features_in_,)

Names of features seen during fit. Defined only when X has feature names that are all strings.New in version 1.0.
n_samples_seen_int or ndarray of shape (n_features,)

The number of samples processed by the estimator for each feature. If there are no missing samples, the n_samples_seen will be an integer, otherwise it will be an array of dtype int. If sample_weights are used it will be a float (if no missing data) or an array of dtype float that sums the weights seen so far. Will be reset on new calls to fit, but increments across partial_fit calls.

方法

`fit`(X[, y, sample_weight])	Compute the mean and std to be used for later scaling.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_feature_names_out`([input_features])	Get output feature names for transformation.
`get_params`([deep])	Get parameters for this estimator.
`inverse_transform`(X[, copy])	Scale back the data to the original representation.
`partial_fit`(X[, y, sample_weight])	Online computation of mean and std on X for later scaling.
`set_output`(*[, transform])	Set output container.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X[, copy])	Perform standardization by centering and scaling.

参考链接：scikit-learn.org/stable/modu…

from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

4.lightGBM预测

LGBMRegressor 是LGBMModel 的子类，它用于回归任务。

4.1 LGBMRegressor

class lightgbm.LGBMRegressor(boosting_type='gbdt', num_leaves=31, max_depth=-1, 
   learning_rate=0.1, n_estimators=10, max_bin=255, subsample_for_bin=200000,
   objective=None, min_split_gain=0.0, min_child_weight=0.001, min_child_samples=20,
   subsample=1.0, subsample_freq=1, colsample_bytree=1.0, reg_alpha=0.0,
   reg_lambda=0.0, random_state=None, n_jobs=-1, silent=True, **kwargs)

4.2 fit

fit(X, y, sample_weight=None, init_score=None, eval_set=None, eval_names=None,
    eval_sample_weight=None, eval_init_score=None, eval_metric='l2',
    early_stopping_rounds=None, verbose=True, feature_name='auto',
    categorical_feature='auto', callbacks=None)

from lightgbm import LGBMRegressor
model = LGBMRegressor()
model.fit(X_train, y_train)

5.模型评估

# 预测测试数据
y_pred = model.predict(X_test)
y_pred[0:5]

# 预测值和实际值对比
a = pd.DataFrame()  # 创建一个空DataFrame 
a['预测值'] = list(y_pred)
a['实际值'] = list(y_test)
a.head()

6.评分

# 查看评分
model.score(X_test, y_test)

7.调参

GridSearchCV的sklearn官方网址：

class sklearn.model_selection.GridSearchCV(estimator, param_grid, *** , scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2n_jobs'* , error_score=nan, return_train_score=False)

主要方法如下所示：

`decision_function`(X)	Call decision_function on the estimator with the best found parameters.
`fit`(X[, y, groups])	Run fit with all sets of parameters.
`get_params`([deep])	Get parameters for this estimator.
`inverse_transform`(Xt)	Call inverse_transform on the estimator with the best found params.
`predict`(X)	Call predict on the estimator with the best found parameters.
`predict_log_proba`(X)	Call predict_log_proba on the estimator with the best found parameters.
`predict_proba`(X)	Call predict_proba on the estimator with the best found parameters.
`score`(X[, y])	Return the score on the given data, if the estimator has been refit.
`score_samples`(X)	Call score_samples on the estimator with the best found parameters.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Call transform on the estimator with the best found parameters.

# 参数调优
from sklearn.model_selection import GridSearchCV  # 网格搜索合适的超参数
parameters = {'num_leaves': [15, 31, 62], 'n_estimators': [20, 30, 50, 70], 'learning_rate': [0.1, 0.2, 0.3, 0.4]}  # 指定分类器中参数的范围
model = LGBMRegressor()  # 构建模型
grid_search = GridSearchCV(model, parameters,scoring='r2',cv=5) # cv=5表示交叉验证5次，scoring='r2'表示以R-squared作为模型评价准则
# 输出参数最优值
grid_search.fit(X_train, y_train)  # 传入数据
grid_search.best_params_  # 输出参数的最优值

8.再次训练

# 重新搭建LightGBM回归模型
model = LGBMRegressor(num_leaves=15, n_estimators=50,learning_rate=0.1)
model.fit(X_train, y_train)
 
# 查看得分
model.score(X_test, y_test)

学生考试数据分析

一、学生考试数据分析

1.背景描述

2.数据说明

3.数据来源

二、数据分析

1.数据读取

2.缺失值分析

2.1缺失值统计 isnull().sum()

2.2缺失值统计 isnull().any()

2.3缺失值统计

3.性别分析

3.1 sample随机抽取

3.2 value_counts()统计

4.总分排序sort_values

5.删除空值列

三、什么因素(特征)对考试成绩影响最大?

1.序列化 LabelEncoder

2.相关性分析

3. 结论

四、建立学生成绩预测模型

1.构造x，y

2.数据集划分

2.1 train_test_split()用法

2.2 参数 描述

3.数据标准化

4.lightGBM预测

4.1 LGBMRegressor

4.2 fit

5.模型评估

6.评分

7.调参

8.再次训练

2.2 参数描述