0%

西瓜书习题3-4

参考:https://blog.csdn.net/Snoopy_Yuan/article/details/64131129

完整代码:https://github.com/happybear1234/The-Watermelon-book-exercises/blob/master/Practical_3.4/code/Practical_3.4.py

习题3.4 选择两个UCI数据集,比较10折交叉验证法和留一法所估计出的对率回归的错误率

这里从UCI分别选择了数据集Iris Data Set Blood Transfusion Service Center Data Set;通过sklearn库实现,seaborns进行可视化,另外seaborns自带iris的数据集,可以直接拿来用

载入数据,预处理

1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

iris = sns.load_dataset('iris') # 在线载入自带的 iris 数据集
X = iris.values[:, 0 : 4]
y = iris.values[:, 4]

sns.set(style='white') # 风格设置
g = sns.pairplot(iris, hue='species', markers=['o', 's', 'D']) # 变量关系组图
plt.show()

1

可以看到iris数据类间比较分散,也是后面测试结果比较好的原因之一

sklearn库

1
2
3
4
5
6
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import model_selection

log_model = LogisticRegression(max_iter=1000) # 增加最大迭代次数,也可以减少数据量
m, n = np.shape(X)

十折交叉训练

model_selection.cross_val_predict 指定模型直接就返回测试结果

1
2
3
4
5
y_pred_10_fold = model_selection.cross_val_predict(log_model, X, y, cv=10)

# 打印精度
accuracy_10_fold = metrics.accuracy_score(y, y_pred_10_fold)
print('The accuracy of 10-fold cross-validation:', accuracy_10_fold)
The accuracy of 10-fold cross-validation: 0.9733333333333334

留一法

留一法相当于k折交叉训练中,把k取为所有的样例数m,因此要经过m次训练,用循环来实现

1
2
3
4
5
6
7
8
9
10
accuracy_LOO = 0
# 计算 m 次测试的结果
for train_index, test_index in model_selection.LeaveOneOut().split(X):
X_train, X_test = X[train_index], X[test_index] # 训练集样本,测试集样本
y_train, y_test = y[train_index], y[test_index] # 训练集标签, 测试集标签
log_model.fit(X_train, y_train) # 训练模型
y_pred_LOO = log_model.predict(X_test) # 测试
if y_pred_LOO == y_test:
accuracy_LOO += 1
print('The accuracy of Leave-One-Out:', accuracy_LOO / m)
The accuracy of Leave-One-Out: 0.9666666666666667

对于iris数据集,精度比较高,相应错误率较低

类似的,对Transfusion数据集可视化:
2

类间分散比较紧凑

相应精度:
The accuracy of 10-fold cross-validation: 0.7687165775401069
The accuracy of Leave-One-Out: 0.7700534759358288

通过以上对比,十折交叉验证法与留一法精度相差不大;而且通过实验,留一法代码跑的时间更长,对于数据越大,这种现象越明显.
因此往后,选择十折交叉验证即可满足精度要求,也节约运行成本