载入数据,预处理

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

iris = sns.load_dataset('iris') # 在线载入自带的 iris 数据集
X = iris.values[:, 0 : 4]
y = iris.values[:, 4]

sns.set(style='white') # 风格设置
g = sns.pairplot(iris, hue='species', markers=['o', 's', 'D']) # 变量关系组图
plt.show()

可以看到iris数据类间比较分散,也是后面测试结果比较好的原因之一

sklearn库

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import model_selection

log_model = LogisticRegression(max_iter=1000) # 增加最大迭代次数,也可以减少数据量
m, n = np.shape(X)

十折交叉训练

model_selection.cross_val_predict 指定模型直接就返回测试结果

y_pred_10_fold = model_selection.cross_val_predict(log_model, X, y, cv=10)

# 打印精度
accuracy_10_fold = metrics.accuracy_score(y, y_pred_10_fold)
print('The accuracy of 10-fold cross-validation:', accuracy_10_fold)

The accuracy of 10-fold cross-validation: 0.9733333333333334

留一法

留一法相当于k折交叉训练中,把k取为所有的样例数m,因此要经过m次训练,用循环来实现

accuracy_LOO = 0
# 计算 m 次测试的结果
for train_index, test_index in model_selection.LeaveOneOut().split(X):
    X_train, X_test = X[train_index], X[test_index] # 训练集样本,测试集样本
    y_train, y_test = y[train_index], y[test_index] # 训练集标签, 测试集标签
    log_model.fit(X_train, y_train) # 训练模型
    y_pred_LOO = log_model.predict(X_test) # 测试
    if y_pred_LOO == y_test:
        accuracy_LOO += 1
print('The accuracy of Leave-One-Out:', accuracy_LOO / m)

The accuracy of Leave-One-Out: 0.9666666666666667

对于iris数据集,精度比较高,相应错误率较低

类似的,对Transfusion数据集可视化:

类间分散比较紧凑

相应精度:
The accuracy of 10-fold cross-validation: 0.7687165775401069
The accuracy of Leave-One-Out: 0.7700534759358288

通过以上对比,十折交叉验证法与留一法精度相差不大;而且通过实验,留一法代码跑的时间更长,对于数据越大,这种现象越明显.
因此往后,选择十折交叉验证即可满足精度要求,也节约运行成本