LaTex论文模板

发表于 2020-06-15 分类于论文阅读次数：
本文字数： 8.2k 阅读时长 ≈ 7 分钟

\documentclass{ctexart}%中文
\pagestyle{empty}%去掉页眉
\usepackage{graphicx}%插图
\begin{document}
%封面
\begin{flushright}
	{\Large 分\hspace{2cm}数}：\underline{\quad\qquad\qquad}
	\vskip 0.5cm
	{\Large 任课教师签字}：\underline{\quad\qquad\qquad}
\end{flushright} 
\begin{center}
	\quad \\
	\quad \\
	\heiti \fontsize{30}{17} 华\quad 北\quad 电\quad 力\quad 大\quad 学\quad 研\quad 究\quad 生\quad 结\quad 课\quad 作\quad 业
	\vskip 6cm
	%\heiti \zihao{2} 在此打印论文题目，二号黑体	
\end{center}
\vskip 3cm
\begin{quotation}
	\songti \fontsize{15}{15}
	\par\setlength\parindent{8em}
	\quad 
	
	学\hspace{0.2cm} 年\hspace{0.2cm}学\hspace{0.2cm}期：\underline{{\Large 2019-2020}学年第二学期}
	\vskip 0.5cm
	课\hspace{0.2cm} 程\hspace{0.2cm}名\hspace{0.2cm}称：\underline{数据仓库与数据挖掘\qquad}
	\vskip 0.5cm
	学\hspace{0.2cm} 生\hspace{0.1cm} 姓\hspace{0.1cm} 名：\underline{\qquad\qquad\qquad 肖雄\qquad\qquad\qquad }
	\vskip 0.5cm
	学\hspace{1.7cm} 号：\underline{\quad\qquad {\Large 2192221067}\qquad\qquad\quad}
	\vskip 0.5cm
	提\hspace{0.3cm}交\hspace{0.3cm}时\hspace{0.2cm}间：\underline{\qquad {\Large 2020}年{\Large 06}月{\Large 15}日\qquad}
	\vskip 2cm
	\centering
\end{quotation}
\title{逻辑回归分类预测的分析与应用}
\date{}
\maketitle

%中文摘要
\begin{abstract}
	基于逻辑回归模型,对二分类问题进行分类预测;通过sklearn逻辑回归库与梯度下降算法实现做对比;并分别采用留一法与十折交叉验证法对Iris数据集和Blood Transfusion Service Center数据集进行分类;实验结果表明，留一法与十折交叉验证法精度相差不大，但十折交叉验证更加高效。
	\newline%另起一行
	
	\centering%使得关键字居中
	\textbf{关键字：}逻辑回归，梯度下降，留一法，十折交叉验证法
\end{abstract}
%英文摘要
\newcommand{\enabstractname}{Abstract}
\newenvironment{enabstract}{%
	\par\small
	\noindent\mbox{}\hfill{\bfseries \enabstractname}\hfill\mbox{}\par
	\vskip 2.5ex}{\par\vskip 2.5ex}  
\begin{enabstract}
	Based on logistic regression model, the classification prediction of dichotomy problem is carried out.The sklearn logistic regression library is compared with the implementation of gradient descent algorithm.And one method and one thousand one hundred percent cross validation method were used respectively to Iris data set and Blood Transfusion Service Center data set classification;The experimental results show that the accuracy of the retention method is not different from that of the ten fold cross validation method, but the ten fold cross validation method is more efficient.
	
	\centering
	\textbf{Keywords:} Logistic regression, gradient descent, retention method, ten fold cross validation
\end{enabstract}

\section{引言}
	随着信息化社会的高速发展，信息多元化成为主要发展模式。人们使用更多的特征属性描述数据信息，通常某一数据记录使用成千上万的特征描述。在数据挖掘领域中，有众多算法模型对数据进行特征提取分类,针对不同的评估方法，对模型的要求和测试的结果也不一样，逻辑回归模型作为一种高效、易实现的模型应用十分广泛。本文基于逻辑回归模型分别用sklearn逻辑回归库和梯度下降实现二分类问题；对留一法和十折交叉验证法分别评估做比较。实验结果表明，十折交叉验证法在满足精度的同一条件下，耗时更小。
\section{逻辑回归模型}
	逻辑回归是比较常用的机器学习方法，用于估计某种事物的可能性。比如某用户购买某商品的可能性，某病人患有某种疾病的可能性，以及某广告被用户点击的可能性等\cite{1}。逻辑回归延伸了多元线性回归思想，即因变量是二值的情形，自变量为$x_{1}$,$x_{2}$，$x_{3}$,…,$x_{k}$。逻辑回归是用来测量分类结果与因变量之间的关系。逻辑回归模型的最终结果为0,1分类结果。其中1表示属于该类，0表示不属于该类别。
	\par{一般线性模型：}$$f(x)=\omega_{1}x_{1}+\omega_{2}x_{2}…+\omega_{d}x_{d}+b$$
	\par{其中$x_{1}...x_{d}$表示d个特征属性值，$\omega_{1}...\omega_{d},b$表示特征属性参数值。用向量简化为：$$f(x)={\omega}x^T+b$$}
	\par{也可表示为：$$f(x)={\beta}\hat{x}^T$$}
	\par{其中$\beta=({\omega};b)$,$\hat{x}=(x;1)$,线性回归只能预测连续的值，对于离散的二分类问题需要转化为逻辑回归，即将线性结果映射到\{0,1\}上:$$\ln\frac{f(x)}{1-f(x)}={\beta}\hat{x}^T$$}
	\par{因此，逻辑回归又称为对数几率回归：$$f(x)=\frac{1}{1+e^{-\beta\hat{x}^T}}$$}
	\par{对于逻辑回归模型，关键在于如何求得$\beta$的值，最常用的方法就是梯度下降法。}
	%文献索引
	\cite{2}
\section{梯度下降}
	梯度下降法是求解无约束优化问题的方法之一，有计算过程简单、初始收敛较快等优点，因此也常作为其他算法的核心算法，例如人工神经网络和逻辑回归，广泛应用于数据挖掘、模式识别等领域\cite{3}。
	\par{对于一阶无约束优化问题$min_{x}f(x)$,若能找到$x^{0},x^{1},x^{2}...$满足：$$f(x^{t+1})<f(x^{t}),t=0,1,2...$$}
	\par{不断执行此过程可收敛到局部极小点，根据泰勒展开式：$$f(x+\Delta x)\approx f(x)+\Delta x\nabla f(x)$$}
	\par{于是，欲满足$f(x+\Delta x)<f(x)$，可选择$$\Delta x=-\gamma\nabla f(x)$$}
	\par{其中$\gamma$是小常数。这就是梯度下降}\cite{4}。
	\par{在逻辑回归模型中，可通过极大似然法来估计$\beta$的值\cite{5}：$$\psi(\beta)=\sum_{i=1}^{m}(y_{i}p(y=1|\beta\hat{x}^T_{i})+(1-y_{i})p(y=0|\beta\hat{x}^T_{i}))=\sum_{i=1}^{m}(y_{i}\beta\hat{x}^T_{i}-ln(1+e^{\beta\hat{x}^T_{i}}))$$}
	\par{该函数为连续可导凸函数，因此可采用梯度下降来求解，因此将上式转化为最小化：$$\psi(\beta)=\sum_{i=1}^{m}(-y_{i}\beta\hat{x}^T_{i}+ln(1+e^{\beta\hat{x}^T_{i}}))$$}
	\par{迭代过程：$$\beta^{t+1}=\beta^{t}-\gamma\nabla \psi(\beta)$$}
\section{应用}
	本文使用文献2中西瓜数据集$3.0\alpha$，分布情况如图\ref{fig:xigua}
	\begin{figure}[h]
		\centering
		\includegraphics[width=0.7\linewidth]{1}
		\caption{西瓜数据集$3.0\alpha$散点图\label{fig:xigua}}
		\label{fig:1}
	\end{figure}
	\par{特征属性为密度和含糖量，样本标签“1”表示好瓜，“0”表示坏瓜；使用留出法选择相同的训练集和测试集，通过sklearn逻辑回归库和批量梯度下降法分别进行分类训练和测试，其中梯度下降设置固定步长为0.1，在迭代15000次后趋于稳定，如图\ref{fig:diedai}}
	\begin{figure}
		\centering
		\includegraphics[width=0.7\linewidth]{2}
		\caption{批量梯度下降迭代曲线\label{fig:diedai}}
		\label{fig:2}
	\end{figure}。
	\par{经测试两者精度相差不大，测试结果如下表:}
	\par{}
	%表格
	\begin{tabular}{|c|c|}
		\hline 
		方法&精度  \\ 
		\hline 
		sklearn逻辑回归库&67\%  \\ 
		\hline 
		批量梯度下降法&66.67\%  \\ 
		\hline 
	\end{tabular} 
	\par{考虑到该数据样本过小，使用留出法拟合效果一般，因此在UCI选择Iris数据集和Blood Transfusion Service Center数据集，同样以逻辑回归模型分别对留一法和十折交叉验证法评估做比较，结果如下：}
	\par{}
	\begin{tabular}{|c|c|c|}
		\hline 
		数据集&留一法&十折交叉验证法  \\ 
		\hline 
		Iris&96.66\%&97.33\%\\ 
		\hline 
		Blood Transfusion Service Center&76.87\%&77.01\%\\ 
		\hline 
	\end{tabular} 
	\par{Iris数据集因为数据类间分散情况比较好，广泛被引用，因此拟合效果比Blood Transfusion Service Center数据集要好，从图\ref{fig:i},\ref{fig:t}就能看出：}
	\begin{figure}
		\centering
		\includegraphics[width=0.7\linewidth]{i}
		\caption{Iris数据集散点图\label{fig:i}}
		\label{fig:i}
	\end{figure}
	\begin{figure}
		\centering
		\includegraphics[width=0.7\linewidth]{t}
		\caption{Blood Transfusion Service Center数据集散点图\label{fig:t}}
		\label{fig:t}
	\end{figure}
	
	
	\par{实验结果表明，留一法和十折交叉验证法的精度相差不大。值得注意的是，十折交叉验证更加高效，耗时更少，对于数据量越大，这种现象越明显。因此选择十折交叉验证即可满足精度要求，又减少运行成本。}
\section{总结}
	逻辑回归对于二分类问题，不仅将预测值映射到$\{0,1\}$之间的值，而且还能评估出概率值，这使得在许多领域都具有广泛的应用。一个好的模型通常是由测试结果来判定，因此对训练集、测试集进行划分的评估方法起着决定性作用，本文对留一法和十折交叉验证法评估结果比较，结果显示在数据量足够的情况，选择十折交叉验证法更加高效。
	
	
\begin{thebibliography}{}
	\bibitem{1}毛林,陆全华,程涛.\emph{基于高维数据的集成逻辑回归分类算法的研究与应用[J]},
	\texttt{科技通报,2013,29(12):64-66}
	\bibitem{2}周志华.\emph{机器学习[M]},
	\texttt{2016:53-59}
	\bibitem{3}郭跃东,宋旭东.\emph{梯度下降法的分析和改进[J]},
	\texttt{科技展望,2016,26(15):115+117}
	\bibitem{4}周志华.\emph{机器学习[M]},
	\texttt{2016:407-408}
	\bibitem{5}周志华.\emph{机器学习[M]},
	\texttt{2016:59-60}
\end{thebibliography}

\end{document}

西瓜书习题3-4

发表于 2020-05-24 分类于机器学习阅读次数：
本文字数： 2k 阅读时长 ≈ 2 分钟

参考:https://blog.csdn.net/Snoopy_Yuan/article/details/64131129

完整代码:https://github.com/happybear1234/The-Watermelon-book-exercises/blob/master/Practical_3.4/code/Practical_3.4.py

习题3.4 选择两个UCI数据集，比较10折交叉验证法和留一法所估计出的对率回归的错误率

这里从UCI分别选择了数据集Iris Data Set 和 Blood Transfusion Service Center Data Set;通过sklearn库实现,seaborns进行可视化,另外seaborns自带iris的数据集,可以直接拿来用

载入数据,预处理

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

iris = sns.load_dataset('iris') # 在线载入自带的 iris 数据集
X = iris.values[:, 0 : 4]
y = iris.values[:, 4]

sns.set(style='white') # 风格设置
g = sns.pairplot(iris, hue='species', markers=['o', 's', 'D']) # 变量关系组图
plt.show()

可以看到iris数据类间比较分散,也是后面测试结果比较好的原因之一

sklearn库

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import model_selection

log_model = LogisticRegression(max_iter=1000) # 增加最大迭代次数,也可以减少数据量
m, n = np.shape(X)

十折交叉训练

model_selection.cross_val_predict 指定模型直接就返回测试结果

y_pred_10_fold = model_selection.cross_val_predict(log_model, X, y, cv=10)

# 打印精度
accuracy_10_fold = metrics.accuracy_score(y, y_pred_10_fold)
print('The accuracy of 10-fold cross-validation:', accuracy_10_fold)

The accuracy of 10-fold cross-validation: 0.9733333333333334

留一法

留一法相当于k折交叉训练中,把k取为所有的样例数m,因此要经过m次训练,用循环来实现

accuracy_LOO = 0
# 计算 m 次测试的结果
for train_index, test_index in model_selection.LeaveOneOut().split(X):
    X_train, X_test = X[train_index], X[test_index] # 训练集样本,测试集样本
    y_train, y_test = y[train_index], y[test_index] # 训练集标签, 测试集标签
    log_model.fit(X_train, y_train) # 训练模型
    y_pred_LOO = log_model.predict(X_test) # 测试
    if y_pred_LOO == y_test:
        accuracy_LOO += 1
print('The accuracy of Leave-One-Out:', accuracy_LOO / m)

The accuracy of Leave-One-Out: 0.9666666666666667

对于iris数据集,精度比较高,相应错误率较低

类似的,对Transfusion数据集可视化:

类间分散比较紧凑

相应精度:
The accuracy of 10-fold cross-validation: 0.7687165775401069
The accuracy of Leave-One-Out: 0.7700534759358288

通过以上对比,十折交叉验证法与留一法精度相差不大;而且通过实验,留一法代码跑的时间更长,对于数据越大,这种现象越明显.
因此往后,选择十折交叉验证即可满足精度要求,也节约运行成本

西瓜书习题3-3

发表于 2020-05-23 更新于 2020-05-25 分类于机器学习阅读次数：
本文字数： 7.8k 阅读时长 ≈ 7 分钟

参考:
[1]https://www.cnblogs.com/judejie/p/8999832.html
[2]https://blog.csdn.net/zouxy09/article/details/20319673
[3]https://blog.csdn.net/Snoopy_Yuan/article/details/63684219?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.nonecase

完整代码:
(sklearn库实现)https://github.com/happybear1234/The-Watermelon-book-exercises/blob/master/Practical_3.3/code/Practical_3.3.py
(梯度下降实现)https://github.com/happybear1234/The-Watermelon-book-exercises/blob/master/Practical_3.3/code/Practical_3.3_self_def.py

习题3.3 编程实现对率回归,并给出西瓜数据集(如下)上的结果.

密度	含糖率	好瓜
0.697	0.46	是
0.774	0.376	是
0.634	0.264	是
0.608	0.318	是
0.556	0.215	是
0.403	0.237	是
0.481	0.149	是
0.437	0.211	是
0.666	0.091	否
0.243	0.267	否
0.245	0.057	否
0.343	0.099	否
0.639	0.161	否
0.657	0.198	否
0.36	0.37	否
0.593	0.042	否
0.719	0.103	否

因此将以上数据中好瓜表示为”1”,不好的瓜表示为”0”,转换成csv文件便于读取

公式说明

基础线性模型:
$$f(x)=\omega_{1}x_{1}+\omega_{2}x_{2}…+\omega_{d}x_{d}+b$$
可转化为向量:
$$f(x)={\omega}x^T+b$$
继而令$\beta=({\omega},b)$,$\hat{x}=(x,1)$,那么(注:此处均作为行向量,与西瓜书上相反,):
$$f(x)={\omega}x^T+b={\beta}\hat{x}^T$$
为了解决此处的二分类问题,将预测值映射成$y\in{0,1}$的值,即将线性回归转化为逻辑回归,常常采用以下的对数几率函数(sigmoid函数)代替:
$$y=\frac{1}{1+e^{-\beta\hat{x}^T}}$$
因此,只要求得$\omega$和$b$的值即可,以下通过极大似然法来估计$\omega$和$b$的值:
$$\psi(\beta)=\sum_{i=1}^{m}(y_{i}\beta\hat{x}^T_{i}-ln(1+e^{\beta\hat{x}^T_{i}}))$$
将上式最大化转化为最小化,便于后面梯度下降求解,如西瓜书P59公式3.27:
$$\psi(\beta)=\sum_{i=1}^{m}(-y_{i}\beta\hat{x}^T_{i}+ln(1+e^{\beta\hat{x}^T_{i}}))$$

该函数为连续可导凸函数(对应海塞矩阵正定),因此可用梯度下降求得最优解,梯度为:
$$\frac{\varphi\psi(\beta)}{\varphi\beta}=\sum_{i=1}^{m}(-y_{i}+\frac{1}{1+e^{-\beta\hat{x}^T}})\hat{x}_{i}$$
迭代过程(其中$\lambda$为步长)为:
$$\beta^{t+1}=\beta^{t}-\lambda\frac{\varphi\psi(\beta)}{\varphi\beta}$$

sklearn库实现

载入数据,预处理

import numpy as np
import matplotlib.pyplot as plt


dataset = np.loadtxt('/home/data/watermelon_3a.csv', delimiter=',')

X = dataset[:, 1 : 3]
y = dataset[:, 3]
print(np.shape(X))

(17, 2)

绘制分散图,查看数据分散情况:

f1 = plt.figure(1)
plt.title('watermelon_3a')
plt.xlabel('density')
plt.ylabel('rate_sugar')
plt.scatter(X[y == 0, 0], X[y == 0, 1], marker = 'o', color = 'k', s = 100, label= 'bad')
plt.scatter(X[y == 1, 0], X[y ==1, 1], marker= 'o', color = 'g', s = 100, label = 'good')
plt.legend(loc = 'upper right')
plt.show()

sklearn逻辑回归库拟合

调用sklearn中的逻辑回归模型进行训练和预测

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import matplotlib.pylab as pl

# 切分数据集:留出法 返回 划分好的训练集测试集样本和训练集测试集标签
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.5, random_state=0)

# 训练模型
log_model = LogisticRegression()
log_model.fit(X_train, y_train)

# 模型测试
y_pred = log_model.predict(X_test)

打印混淆矩阵和相关度量,结果如下:

1
2
3

print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))
precision, recall, thresholds = metrics.precision_recall_curve(y_test, y_pred)

[[3 2]
 [1 3]]

precision    recall  f1-score   support

     0.0       0.75      0.60      0.67         5
     1.0       0.60      0.75      0.67         4

    accuracy                           0.67         9
   macro avg       0.68      0.68      0.67         9
weighted avg       0.68      0.67      0.67         9

这里选择的留出法抽取样本,因为样本比较少,拟合效果一般,预测精度只有67%,可以采用自助法或者交叉验证法重新抽样,进一步选择最优模型

绘制决策边界

f2 = plt.figure(2)
h = 0.01
x0_min, x0_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
x1_min, x1_max = X[:, 1].min() - 0.1, X[:, 1].max() +0.1
x0, x1= np.meshgrid(np.arange(x0_min, x0_max, h), np.arange(x1_min, x1_max, h)) # 生成笛卡尔积坐标矩阵

z = log_model.predict(np.c_[x0.ravel(), x1.ravel()]) # c_ 按列合并, ravel 降成一维

z = z.reshape(x0.shape)
plt.contourf(x0, x1, z, cmap = pl.cm.Paired)# 等高线

plt.title('watermelon_3a')
plt.xlabel('density')
plt.ylabel('rate_sugar')
plt.scatter(X[y==0, 0], X[y==0, 1], marker='o', color='k', s=100, label='bad')
plt.scatter(X[y==1, 0], X[y==1, 1], marker='o', color='g', s=100, label='good')
plt.legend(loc='upper right')
plt.show()

可以看出训练出来的分类器还是可以分类出大多数示例

梯度下降法实现

实现以上公式

# 1)实现 P59 公式3.27极大似然法

def likelihood_sub(x, y, beta):
    """
    :param x: 一个示例变量(行向量)
    :param y:一个样品标签(行向量)
    :param beta:3.27中矢量参数(行向量)
    :return: 单个对数似然 3.27
    """
    return -y * np.dot(beta, x.T) + np.math.log(1 + np.math.exp(np.dot(beta, x.T)))

def likelihood(X, y, beta):
    """
    公式 3.27 :对数似然函数(交叉熵损失函数)
    :param X: 示例变量矩阵
    :param y:样本标签矩阵
    :param beta:3.27中的矢量参数
    :return: beta 的似然值
    """
    sum = 0
    m, n = np.shape(X)

    for i in range(m):
        sum += likelihood_sub(X[i], y[i], beta)
    return sum

# 2)实现似然公式一阶偏导
def sigmoid(x, beta):
    """
    基础模型 S 形函数
    P59 对数几率回归(逻辑回归)公式 3.23
    :param x: 预测变量
    :param beta: beta 变量
    :return:S 形函数
    """
    return 1 / (1 + np.math.exp(- np.dot(beta, x.T)))
def partial_derivative(X, y, beta):
    """
    P60 似然公式一阶偏导3.30
    :param X:示例变量矩阵
    :param y:样本标签矩阵
    :param beta:3.27 中矢量参数
    :return: beta 的偏导数,梯度
    """
    m, n = np.shape(X)
    pd = np.zeros(n)

    for i in range(m):
        tmp = -y[i] + sigmoid(X[i], beta)
        for j in range(n):
            pd[j] += X[i][j] * tmp
    return pd

这里采用批量梯度下降法:

def gradDscent(X, y, alpha, iterations, n):
    """
    :param X:变量矩阵
    :param y:样本标签数组
    :return:3.27中beta参数最优解
    """
    cost = np.zeros(iterations) # 构建 max_times 个 0 的数组
    beta = np.mat(np.zeros(n)) # 初始化 beta

    for i in range(iterations):
        # 梯度下降
        output = partial_derivative(X, y, beta)
        beta = beta - alpha * output
        cost[i] = likelihood(X, y, beta)

    return beta, cost

绘制收敛曲线

def showConvergCurve(Iterations, Cost):
    """
    :param Iterations: 迭代次数
    :param Cost: 损失值数组
    """
    f1 = plt.figure(1)
    t = np.arange(Iterations)
    p1 = plt.subplot(1,1,1)
    p1.plot(t, Cost, 'r')
    p1.set_xlabel('Iterations')
    p1.set_ylabel('cost')
    p1.set_title('The Gradient Descent Convergence Curve')

    plt.show()

这里步长取得0.1,迭代次数1500,在800次迭代后趋于稳定

绘制决策边界

def showLogRegression(X, y, Beta, N):
    f2 = plt.figure(2)

    plt.title('The Logistic Regression Fitted Curve')
    plt.xlabel('density')
    plt.ylabel('rate_sugar')
    # f = Beta * X.transpose()
    # plt.plot(X[:, 2], f.tolist()[0], 'r', label = 'Prediction')
    min_x = min(X[:, 0])
    max_x = max(X[:, 0])
    y_min_x = (- Beta.tolist()[0][2] - Beta.tolist()[0][0] * min_x) / Beta.tolist()[0][1] # 由线性模型 y =  w1 * x1 + w2 * x2 +b
    y_max_x = (- Beta.tolist()[0][2] - Beta.tolist()[0][0] * max_x) / Beta.tolist()[0][1]
    plt.plot([min_x, max_x], [y_min_x, y_max_x], '-g')
    plt.scatter(X[y == 0, 0], X[y == 0, 1], marker='o', color='k', s=100, label='bad')
    plt.scatter(X[y == 1, 0], X[y == 1, 1], marker='o', color='g', s=100, label='good')
    plt.legend(loc='upper right')
    plt.show()

看上去并没有sklearn库中逻辑回归分类器效果好,但是计算出来的精度却比分类器要高(这里为了方便没有划分数据集,可以重新用sklearn分好的数据再做对比)

测试

def testLogRegres(Beta, test_x, test_y):
    m, n = np.shape(test_x)
    matchCount = 0
    for i in range(m):
        predict = sigmoid(test_x[i], Beta) > 0.5
        if predict == bool(test_y[i]):
            matchCount += 1
    accuracy = float(matchCount) / m
    return accuracy

def loadData():
    dataset = np.loadtxt('/home/data/watermelon_3a.csv', delimiter=',')

    X = dataset[:, 1: 3]
    tmp = np.ones(X.shape[0])
    X = np.insert(X, 2, values=tmp, axis=1) # 在最后一列插入全是 1 的列
    y = dataset[:, 3]
    return X, y

def main():
    alpha = 0.1  # 迭代步长
    iterations = 1500  # 迭代次数上限
    X, y = loadData()
    test_x = X
    test_y = y
    m, n = np.shape(X)
    beta, cost = gradDscent(X, y, alpha, iterations, n)
    print(beta)
    showConvergCurve(iterations, cost)
    showLogRegression(X, y, beta, n)
    accuracy = testLogRegres(beta, test_x, test_y)
    print('The classify accuracy is: %.3f%%' %(accuracy * 100))

The classify accuracy is: 70.588%

西瓜书习题1.2

发表于 2020-04-28 分类于机器学习阅读次数：
本文字数： 2.7k 阅读时长 ≈ 2 分钟

参考https://blog.csdn.net/yuzeyuan12/article/details/83113461

1.2 与使用单个合取式来进行假设表示相比，使用“析合范式”将使得假设空间具有更强的表示能力。若使用最多包含k个合取式的析合范式来表达表1.1的西瓜分类问题的假设空间，试估算有多少种可能的假设。

表1.1显示色泽:二种,根蒂:三种,敲声:三种;因此析取范式共有(不含空集):3*4*4=48;特征集共有:2*3*3=18.
此题关键点是去冗余操作,因为特征集最多为18,析合范式里去冗余后的析取范式也不会超过18;因此用一个18维向量就可以全部表示出去冗余后的所有取值,其中18维向量中全为1时,表示至多的情况;全为0时为初始值

代码:

import numpy as np
import itertools as it
import datetime

""""
1) 从 0-47 中抽取 k 个组合 sample_combin
2) 将 sample_combin 中的元素依次转换成三维 (3,4,4) 中的对应坐标 coord_3
3) 将 coord_3 再依次转换成 0/1 二值形式的 18 维向量 vector_18,并依次添加到列表 vector 做去冗余操作
4) 把 vector 映射到 1-2^18 对应数值 num,并依次添加到集合 num_set 筛选重复的数
5) 最后 num_set 的长度即为最终要求的结果
"""
# 数值转换成三维(3,4,4)
def turn_48_to_coord_3(num):
    for i in range(3):
        for j in range(4):
            for k in range(4):
                if i * 16 + j * 4 + k == num:
                    return [i + 1,j + 1,k + 1]

# 三维(3,4,4)转换成 18 维向量
def coord_3_to_18(coord_3):
    vector_18 = np.zeros([2,3,3])
    # 如果色泽为 *
    if coord_3[0] == 3:
        coord_3[0] = [1, 2]
    else:
        coord_3[0] = [coord_3[0]]
    # 如果根蒂为 *
    if coord_3[1] == 4:
        coord_3[1] = [1, 2, 3]
    else:
        coord_3[1] = [coord_3[1]]
    # 如果敲声为 *
    if coord_3[2] == 4:
        coord_3[2] = [1, 2, 3]
    else:
        coord_3[2] = [coord_3[2]]
    for x in coord_3[0]:
        for y in coord_3[1]:
            for z in coord_3[2]:
                # 映射到 18 维向量的值为 1 表示相应特征
                vector_18[x-1][y-1][z-1] = 1
    return vector_18


# 获得 0-48 数值转换成 18 维向量的结果
def get_48_to_18(num):
    coord_3 = turn_48_to_coord_3(num)
    vector_18 = coord_3_to_18(coord_3)
    return vector_18
def main(k):
    num_set = []
    # 从 0-47 中抽取 k 个组合
    for sample_combin in it.combinations(range(48),k):
        vector = []
        for i in range(k):
            vector_18 = get_48_to_18(sample_combin[i])
            vector.append(vector_18)
        vector = np.array(vector)
        vector = vector.any(axis=0) # 去冗余操作:按第一个轴方向取或
        vector = np.reshape(vector,[18])
        vector = vector.tolist()
        num = 0
        for i in range(18):
            num += 2 ** i * vector[i] # 0/1 二值 18 维映射成 1-2^18 十进制
        num_set.append(num)
        if len(num_set) > 5000000:
            num_set = list(set(num_set)) # 长度大于 500W 时取一次集合,防止数组太长导致程序崩溃
    # 最后取一次集合
    num_set = list(set(num_set))
    end_time1 = datetime.datetime.now()
    print('k=%d时： %d examples' %(k, len(num_set)))
    print('   用时:', end_time1 - start_time1)

start_time0 = datetime.datetime.now()
for k in range(1,18):
    start_time1 = datetime.datetime.now()
    main(k)
end_time0 = datetime.datetime.now()
print('一共用时',end_time0 - start_time0)

运行结果:
k=1时： 48 examples
```
用时: 0:00:00.001208
```
k=2时： 879 examples
```
用时: 0:00:00.034517
```
k=3时： 8223 examples
```
用时: 0:00:00.668103
```
k=4时： 40911 examples
```
用时: 0:00:09.143752
```
k=5时： 112962 examples
```
用时: 0:01:35.084796
```
k=6时： 193998 examples
```
用时: 0:13:07.760253
```
k=7时： 233640 examples
```
用时: 1:28:37.023360
```

由于是穷举法:不去冗余穷举次数$\sum_{i=1}^{k}C^{k}_{48}=(1+1)^k$(二项式定理),随着k越大,计算量也更大,从运行耗时就能看出

Datawhale零基础入门数据挖掘-Task5

发表于 2020-04-04 分类于数据挖掘及机器学习阅读次数：
本文字数： 2.7k 阅读时长 ≈ 2 分钟

对于多种调参完成的模型进行模型融合

简单加权融合:

回归（分类概率）：算术平均融合（Arithmetic mean），几何平均融合（Geometric mean）；
分类：投票（Voting)
综合：排序融合(Rank averaging)，log融合

stacking/blending:

构建多层模型，并利用预测结果再拟合预测。

boosting/bagging（在xgboost，Adaboost,GBDT中已经用到）:

多树的提升方法

import numpy as np
import pandas as pd

## 定义结果的加权平均函数
def Weighted_method(test_pre1,test_pre2,test_pre3,w=[1/3,1/3,1/3]):
    Weighted_result = w[0]*pd.Series(test_pre1)+w[1]*pd.Series(test_pre2)+w[2]*pd.Series(test_pre3)
    return Weighted_result

from sklearn import metrics
# 各模型的预测结果计算MAE
print('Pred1 MAE:',metrics.mean_absolute_error(y_test_true, test_pre1))
print('Pred2 MAE:',metrics.mean_absolute_error(y_test_true, test_pre2))
print('Pred3 MAE:',metrics.mean_absolute_error(y_test_true, test_pre3))

## 根据加权计算MAE
w = [0.3,0.4,0.3] # 定义比重权值
Weighted_pre = Weighted_method(test_pre1,test_pre2,test_pre3,w)
print('Weighted_pre MAE:',metrics.mean_absolute_error(y_test_true, Weighted_pre))

## 定义结果的加权平均函数
def Mean_method(test_pre1,test_pre2,test_pre3):
    Mean_result = pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).mean(axis=1)
    return Mean_result

Mean_pre = Mean_method(test_pre1,test_pre2,test_pre3)
print('Mean_pre MAE:',metrics.mean_absolute_error(y_test_true, Mean_pre))

## 定义结果的加权平均函数
def Median_method(test_pre1,test_pre2,test_pre3):
    Median_result = pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).median(axis=1)
    return Median_result

Median_pre = Median_method(test_pre1,test_pre2,test_pre3)
print('Median_pre MAE:',metrics.mean_absolute_error(y_test_true, Median_pre))

from sklearn import linear_model

def Stacking_method(train_reg1,train_reg2,train_reg3,y_train_true,test_pre1,test_pre2,test_pre3,model_L2= linear_model.LinearRegression()):
    model_L2.fit(pd.concat([pd.Series(train_reg1),pd.Series(train_reg2),pd.Series(train_reg3)],axis=1).values,y_train_true)
    Stacking_result = model_L2.predict(pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).values)
    return Stacking_result

## 生成一些简单的样本数据，test_prei 代表第i个模型的预测值
train_reg1 = [3.2, 8.2, 9.1, 5.2]
train_reg2 = [2.9, 8.1, 9.0, 4.9]
train_reg3 = [3.1, 7.9, 9.2, 5.0]
# y_test_true 代表第模型的真实值
y_train_true = [3, 8, 9, 5] 

test_pre1 = [1.2, 3.2, 2.1, 6.2]
test_pre2 = [0.9, 3.1, 2.0, 5.9]
test_pre3 = [1.1, 2.9, 2.2, 6.0]

# y_test_true 代表第模型的真实值
y_test_true = [1, 3, 2, 6] 

model_L2= linear_model.LinearRegression()
Stacking_pre = Stacking_method(train_reg1,train_reg2,train_reg3,y_train_true,
                               test_pre1,test_pre2,test_pre3,model_L2)
print('Stacking_pre MAE:',metrics.mean_absolute_error(y_test_true, Stacking_pre))

Datawhale零基础入门数据挖掘-Task4

发表于 2020-04-01 更新于 2020-04-03 分类于数据挖掘及机器学习阅读次数：
本文字数： 15k 阅读时长 ≈ 13 分钟

了解常用的机器学习模型，并掌握机器学习模型的建模与调参流程

读取数据

# reduce_mem_usage 函数通过调整数据类型，帮助我们减少数据在内存中占用的空间
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.
    """
    start_mem = df.memory_usage().sum()
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum()
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

sample_feature = reduce_mem_usage(pd.read_csv("data_for_tree.csv"))

Memory usage of dataframe is 62099672.00 MB
Memory usage after optimization is: 16520303.00 MB
Decreased by 73.4%

1
2

# 返回 x in sample_feature.columns not include ['price','brand','model','brand'] 的列表
continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','brand']]

线性回归 & 五折交叉验证 & 模拟真实业务情况

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
train = sample_feature[continuous_feature_names + ['price']]

train_X = train[continuous_feature_names]
train_y = train['price']
`

简单建模

from sklearn.linear_model import LinearRegression

model = LinearRegression(normalize=True)

model = model.fit(train_X, train_y)

# 查看训练的线性回归模型的截距（intercept）与权重(coef)
print('intercept:'+ str(model.intercept_))

print(sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True))

intercept:-110670.68277241504

[('v_6', 3367064.3416418717), 
('v_8', 700675.5609398251),
 ('v_9', 170630.27723215616), 
('v_7', 32322.661931980558), 
('v_12', 20473.670796988994), 
('v_3', 17868.07954151303), 
('v_11', 11474.9389967116), 
('v_13', 11261.764560019501), 
('v_10', 2683.9200906064084), 
('gearbox', 881.8225039250154), 
('fuelType', 363.9042507216036), 
('bodyType', 189.60271012073036), 
('city', 44.94975120522736), 
('power', 28.553901616752857), 
('brand_price_median', 0.5103728134078609), 
('brand_price_std', 0.4503634709263256), 
('brand_amount', 0.14881120395065583), 
('brand_price_max', 0.0031910186703138638), 
('SaleID', 5.355989919860593e-05), 
('offerType', 4.397239536046982e-06), 
('train', 2.7939677238464355e-07), 
('seller', -2.873130142688751e-07), 
('brand_price_sum', -2.175006868187596e-05), 
('name', -0.0002980012713074109), 
('used_time', -0.002515894332880479), 
('brand_price_average', -0.404904845101148), 
('brand_price_min', -2.2467753486888244), 
('power_bin', -34.42064411727887), 
('v_14', -274.7841180777388), 
('kilometer', -372.89752666073025), 
('notRepairedDamage', -495.1903844628239), 
('v_0', -2045.0549573558887), 
('v_5', -11022.98624082137), 
('v_4', -15121.731109860013), 
('v_2', -26098.29992055148), 
('v_1', -45556.18929726381)]

from matplotlib import pyplot as plt
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)
#绘制特征v_9的值与标签的散点图，图片发现模型的预测结果（蓝色点）与真实标签（黑色点）的分布差异较大，
# 且部分预测值出现了小于0的情况，说明我们的模型存在一些问题
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()

通过作图我们发现数据的标签（price）呈现长尾分布，不利于我们的建模预测。原因是很多模型都假设数据误差项符合正态分布，而长尾分布的数据违背了这一假设。参考博客：https://blog.csdn.net/Noob_daniel/article/details/76087829

import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])
plt.show()

# 在这里我们对标签进行了 log(x+1) 变换，使标签贴近于正态分布
train_y_ln = np.log(train_y + 1)
print('The transformed price seems like normal distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])
plt.show()

model = model.fit(train_X, train_y_ln)

print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

intercept:18.750748443060488

[('v_9', 8.052410408822315), 
('v_5', 5.764240780403914), 
('v_12', 1.618206098241706), 
('v_1', 1.479831064546508), 
('v_11', 1.166900417358536), 
('v_13', 0.9404706327194452), 
('v_7', 0.7137281645215736), 
('v_3', 0.6837863827349204), 
('v_0', 0.00850050520973589), 
('power_bin', 0.008497968353528977), 
('gearbox', 0.007922378343285602), 
('fuelType', 0.006684768936305926), 
('bodyType', 0.004523520651791603), 
('power', 0.0007161895389359644), 
('brand_price_min', 3.334354528992352e-05), 
('brand_amount', 2.897880289491835e-06), 
('brand_price_median', 1.2571187771074404e-06), 
('brand_price_std', 6.659170007178332e-07), 
('brand_price_max', 6.194957302457314e-07), 
('brand_price_average', 5.999348706659352e-07), 
('SaleID', 2.1194159119234957e-08), 
('seller', 1.6262902136077173e-10), 
('offerType', 1.1036149771825876e-10), 
('train', 6.707523425575346e-12), 
('brand_price_sum', -1.5126514245669237e-10), 
('name', -7.015511195846627e-08), 
('used_time', -4.122477016270915e-06), 
('city', -0.002218783709616053), 
('v_14', -0.004234189820672137), 
('kilometer', -0.013835867353556136), 
('notRepairedDamage', -0.27027942480393996), 
('v_4', -0.8315697362911634), 
('v_2', -0.9470821267759207), 
('v_10', -1.6261468392032863), 
('v_8', -40.34300817115224), 
('v_6', -238.79035497319248)]

#再次进行可视化，发现预测结果与真实值较为接近，且未出现异常状况
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price seems normal after np.log transforming')
plt.show()

五折交叉验证

在使用训练集对参数进行训练的时候，经常会发现人们通常会将一整个训练集分为三个部分（比如mnist手写训练集）。一般分为：训练集（train_set），评估集（valid_set），测试集（test_set）这三个部分。这其实是为了保证训练效果而特意设置的。其中测试集很好理解，其实就是完全不参与训练的数据，仅仅用来观测测试效果的数>>据。而训练集和评估集则牵涉到下面的知识了。

因为在实际的训练中，训练的结果对于训练集的拟合程度通常还是挺好的（初始条件敏感），但是对于训练集之外的数据的拟合程度通常就不那么令人满意了。因此我们通常并不会把所有的数据集都拿来训练，而是分出一部分来（这一部分不参加训练）对训练集生成的参数进行测试，相对客观的判断这些参数对训练集之外的数据的符合程度。这种思想就称为交叉验证（Cross Validation）

##使用线性回归模型，对未处理标签的特征数据进行五折交叉验证
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer
def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper

scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.7s finished

1	print('AVG:', np.mean(scores))

AVG: 1.3658024027748357

1
2

#使用线性回归模型，对处理过标签的特征数据进行五折交叉验证（
scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.8s finished

1	print('AVG:', np.mean(scores))

AVG: 0.19325301753940502

scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
print(scores)

	cv1	cv2	cv3	cv4	cv5
MAE	0.190792	0.193758	0.194132	0.191825	0.195758

模拟真实业务情况

但在事实上，由于我们并不具有预知未来的能力，五折交叉验证在某些与时间相关的数据集上反而反映了不真实的情况。通过2018年的二手车价格预测2017年的二手车价格，这显然是不合理的，因此我们还可以采用时间顺序对数据集进行分隔。在本例中，我们选用靠前时间的4/5样本当作训练集，靠后时间的1/5当作验证集，最终结果与五折交叉验证差距不大

# 采用时间顺序对数据集进行分隔 选用靠前时间的4/5样本当作训练集，靠后时间的1/5当作验证集
import datetime
sample_feature = sample_feature.reset_index(drop=True)
split_point = len(sample_feature) // 5 * 4 # 取整除 - 返回商的整数部分（向下取整）

train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()

train_X = train[continuous_feature_names]
train_y_ln = np.log(train['price'] + 1)
val_X = val[continuous_feature_names]
val_y_ln = np.log(val['price'] + 1)

model = model.fit(train_X, train_y_ln)
print(mean_absolute_error(val_y_ln, model.predict(val_X)))

0.19577667229471246

绘制学习率曲线与验证曲线

# 绘制学习率曲线与验证曲线
from sklearn.model_selection import learning_curve, validation_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel('Training example')
    plt.ylabel('score')
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()#区域

    # x：第一个参数表示覆盖的区域，我直接复制为x，表示整个x都覆盖
    # 0：表示覆盖的下限
    # y：表示覆盖的上限是y这个曲线
    # facecolor：覆盖区域的颜色
    # alpha：覆盖区域的透明度[0,1],其值越大，表示越不透明
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1,
                     color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r',
             label="Training score")
    plt.plot(train_sizes, test_scores_mean,'o-',color="g",
             label="Cross-validation score")
    plt.legend(loc="best")
    return plt

plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_ln[:1000], ylim=(0.0, 0.5), cv=5, n_jobs=1)
plt.show()

多种模型对比

train = sample_feature[continuous_feature_names + ['price']].dropna()

train_X = train[continuous_feature_names]
train_y = train['price']
train_y_ln = np.log(train_y + 1)

线性模型 & 嵌入式特征选择

本章节默认，学习者已经了解关于过拟合、模型复杂度、正则化等概念。否则请寻找相关资料或参考如下连接：

用简单易懂的语言描述「过拟合 overfitting」https://www.zhihu.com/question/32246256/answer/55320482
模型复杂度与模型的泛化能力 http://yangyingming.com/article/434/
正则化的直观理解 https://blog.csdn.net/jinping_shi/article/details/52433975

在过滤式和包裹式特征选择方法中，特征选择过程与学习器训练过程有明显的分别。而嵌入式特征选择在学习器训练过程中自动地进行特征选择。嵌入式选择最常用的是L1正则化与L2正则化。在对线性回归模型加入两种正则化方法后，他们分别变成了Lasso回归与岭(Ridge)回归。

# 线性模型 & 嵌入式特征选择
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

models = [LinearRegression(),
          Ridge(),
          Lasso()]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')

LinearRegression is finished
Ridge is finished
Lasso is finished

对三种方法的效果对比

# 对三种方法的效果对比
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
print(result)

	LinearRegression	Ridge	Lasso
cv1	0.190792	0.194832	0.383899
cv2	0.193758	0.197632	0.381893
cv3	0.194132	0.198123	0.384090
cv4	0.191825	0.195670	0.380526
cv5	0.195758	0.199676	0.383611

model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
plt.show()

intercept:18.75072374836874

L2正则化在拟合过程中通常都倾向于让权值尽可能小，最后构造一个所有参数都比较小的模型。因为一般认为参数值小的模型比较简单，能适应不同的数据集，也在一定程度上避免了过拟合现象。可以设想一下对于一个线性回归方程，若参数很大，那么只要数据偏移一点点，就会对结果造成很大的影响；但如果参数足够小，数据偏移得多一点也不会对结果造成什么影响，专业一点的说法是『抗扰动能力强』

model = Ridge().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
plt.show()

intercept:4.671710763117783

L1正则化有助于生成一个稀疏权值矩阵，进而可以用于特征选择。如下图，我们发现power与userd_time特征非常重要

model = Lasso().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
plt.show()

intercept:8.672182470075398

除此之外，决策树通过信息熵或GINI指数选择分裂节点时，优先选择的分裂特征也更加重要，这同样是一种特征选择的方法。XGBoost与LightGBM模型中的model_importance指标正是基于此计算的

非线性模型

Datawhale零基础入门数据挖掘-Task3

发表于 2020-03-28 更新于 2020-03-30 分类于数据挖掘及机器学习阅读次数：
本文字数： 25k 阅读时长 ≈ 23 分钟

特征工程：对于特征进行进一步分析，并对于数据进行处理

常见的特征工程包括

异常处理

通过箱线图（或 3-Sigma）分析删除异常值；
BOX-COX 转换（处理有偏分布）；
长尾截断；

特征归一化/标准化：

标准化（转换为标准正态分布）；
归一化（抓换到 [0,1] 区间）；
针对幂律分布，可以采用公式：$log(\frac{1+x}{1+median})$

数据分桶：

等频分桶；
等距分桶；
Best-KS 分桶（类似利用基尼指数进行二分类）；
卡方分桶；

缺失值处理：

不处理（针对类似 XGBoost 等树模型）；
删除（缺失数据太多）；
插值补全，包括均值/中位数/众数/建模预测/多重插补/压缩感知补全/矩阵补全等；
分箱，缺失值一个箱；

特征构造：

构造统计量特征，报告计数、求和、比例、标准差等；
时间特征，包括相对时间和绝对时间，节假日，双休日等；
地理信息，包括分箱，分布编码等方法；
非线性变换，包括 log/ 平方/ 根号等；
特征组合，特征交叉；
仁者见仁，智者见智。

特征筛选

过滤式（filter）：先对数据进行特征选择，然后在训练学习器，常见的方法有 Relief/方差选择发/相关系数法/卡方检验法/互信息法；
包裹式（wrapper）：直接把最终将要使用的学习器的性能作为特征子集的评价准则，常见方法有 LVM（Las Vegas Wrapper）；
嵌入式（embedding）：结合过滤式和包裹式，学习器训练过程中自动进行了特征选择，常见的有 lasso 回归；

降维

PCA/ LDA/ ICA；
特征选择也是一种降维。

导入数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from operator import itemgetter

Train_data = pd.read_csv("./datalab/used_car_train_20200313.csv", sep=" ")
Test_data = pd.read_csv("./datalab/used_car_testA_20200313.csv", sep=" ")

print(Train_data.shape)

(150000, 31)

1	print(Train_data.head())

	SaleID	name	regDate	model	…	v_11	v_12	v_13	v_14
0	0	736	20040402	30.0	…	2.804097	-2.420821	0.795292	0.914762
1	1	2262	20030301	40.0	…	2.096338	-1.030483	-1.722674	0.245522
2	2	14874	20040403	115.0	…	1.803559	1.565330	-0.832687	-0.229963
3	3	71865	19960908	109.0	…	1.285940	-0.501868	-2.438353	-0.478699
4	4	111080	20120103	110.0	…	0.910783	0.931110	2.834518	1.923482

[5 rows x 31 columns]

1	print(Train_data.columns)

Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
       'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
       'v_13', 'v_14'],
      dtype='object')

删除异常值

## 删除异常值
# 这里我包装了一个异常值处理的代码，可以随便调用
def outliers_proc(data, col_name, scale=3):
    """
    用于清洗异常值，默认用box_plot(scale=3)进行清洗
    :param data:接受 pandas 数据格式
    :param col_name:pandas 列名
    :param scale:尺度
    """
    def box_plot_outliers(data_ser, box_scale):
        """
        利用箱线图去除异常值
        :param data_ser:接收 pandas.Series 数据格式
        :param box_scale: 箱线图尺度 （规定大于上四分位数1.5倍四分位数差 的值，或者小于下四分位数1.5倍四分位数差的值，划为异常值）
        """
        iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25)) # 3倍四分位数差
        val_low = data_ser.quantile(0.25) - iqr # 下限=Q1-3IQR
        val_up = data_ser.quantile(0.75) + iqr # 上限=Q3+3IQR
        rule_low = (data_ser < val_low)
        rule_up = (data_ser > val_up) # 返回 pandas.Series 中对应值的bool
        return (rule_low, rule_up), (val_low, val_up)

    data_n = data.copy() # copy 数据
    data_series = data_n[col_name] # 返回指定 col_name 数据
    rule, value = box_plot_outliers(data_series, box_scale=scale)
    index = np.arange(data_series.shape[0])[rule[0]|rule[1]] # 返回rule_low, rule_up中为True的下标的列表
    print("Delete number is:{}".format(len(index))) # 打印下标列表中个数
    data_n = data_n.drop(index) # 删除(删除后下标没变)
    data_n.reset_index(drop=True, inplace=True) # 重置索引（drop=True删除原来的索引;inplace=True当前修改状态应用到原来Series中）
    print("Now column number is:{}".format(data_n.shape[0])) # 查看删除后的数据个数
    index_low = np.arange(data_series.shape[0])[rule[0]]
    outliers = data_series.iloc[index_low] # ilco-按下标进行索引
    print("Description of data larger than the lower bound is:")
    print(pd.Series(outliers).describe())
    index_up = np.arange(data_series.shape[0])[rule[1]]
    outliers = data_series.iloc[index_up]
    print("Description of data larger than the upper bound is:")
    print(pd.Series(outliers).describe())

    fig, ax = plt.subplots(1, 2, figsize=(10, 7)) # 创建子图:1行2列
    sns.boxplot(y=data[col_name], data=data, palette="Set1", ax=ax[0]) # 箱线图
    sns.boxplot(y=data_n[col_name], data=data_n, palette="Set1", ax=ax[1])
    plt.show()
    return data_n


## 我们可以删掉一些异常数据,以 power 为例
## 这里删不删可以自行判断
## 但是注意 Test 的数据不能删
Train_data = outliers_proc(Train_data, "power", scale=3)

Delete number is:963
Now column number is:149037
Description of data larger than the lower bound is:
count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: power, dtype: float64
Description of data larger than the upper bound is:
count      963.000000
mean       846.836968
std       1929.418081
min        376.000000
25%        400.000000
50%        436.000000
75%        514.000000
max      19312.000000
Name: power, dtype: float64

特征构造

# 训练集和测试集放在一起，方便构造特征
Train_data["train"] = 1 # 添加新字段，并设置值为1
Test_data["train"] = 1
data = pd.concat([Train_data,Test_data],ignore_index=True) # 连接函数 ignore_index=True重置索引

# 使用时间：data["createDate"] - data["regDate"], 反应汽车使用时间，一般来说价格与使用时间成反比
# 不过要注意, 数据里有时间出错的格式, 所以我们需要 errors = "coerce"
data["used_time"] = (pd.to_datetime(data["creatDate"], format="%Y%m%d", errors="coerce") -
pd.to_datetime(data["regDate"],format="%Y%m%d",errors="coerce")).dt.days # to_datetime将参数转换为日期 dt.days每个元素的天数

# 看一下空数据, 有 15k 个样本的时间有问题的, 我们可以选择删除, 也可以选择放着
# 但是这里不建议删除, 因为删除缺失数据占总样本量过大, 7.5%
# 我们可以先放着, 因为如果我们 XGBoost 之类的决策树, 其本身就能处理缺失值, 所以可以不用管
print(data["used_time"].isnull().sum())

# 从邮编中提取城市信息, 相当于加入了先验知识
#print(data["regionCode"])
# 增加city 字段, 并从 regionCode 值的倒数第三位切片(apply 对regionCode每个元素运行指定运算 lambda 匿名函数)
data["city"] = data["regionCode"].apply(lambda x : str(x)[:-3])

# 计算某品牌的销售统计量, 还可以计算其他特征的统计量
# 这里以 train 的数据计算统计量
Train_gb = Train_data.groupby("brand") # 分组
all_info = {}
for kind, kind_data in Train_gb:
    info = {}
    kind_data = kind_data[kind_data["price"] > 0]
    info["brand_amount"] = len(kind_data)
    info["brand_price_max"] = kind_data.price.max()
    info["brand_price_median"] = kind_data.price.median()
    info["brand_price_min"] = kind_data.price.min()
    info["brand_price_sum"] = kind_data.price.sum()
    info["brand_price_std"] = kind_data.price.std() # 样本方差
    info["brand_price_average"] = round(kind_data.price.sum() / (len(kind_data)+1), 2) # round(2)取近似值保留两位数
    all_info[kind] = info

brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index":"brand"}) # T转置   reset_index还原索引  rename并修改列名
data = data.merge(brand_fe, how="left", on="brand") # 合并数据 data链接在brand_fe "brand"字段左边

数据分桶

# 数据分桶 以 power 为例
# 这时候缺失值也进桶了
# 为什么要分桶：
# 1. 离散后稀疏向量内积乘法运算速度更快, 计算结果也方便存储, 容易扩展
# 2. 离散后的特征对异常值更具鲁棒性, 如 age>30 为 1 否则为 0 , 对于年龄为 200 的也不会对模型造成很大的干扰
# 3. LR 属于广义线性模型, 表达能力有限, 经过离散化后, 每个变量有单独的权重, 这相当于引入了非线性, 能够提升模型的表达能力, 加大拟合
# 4. 离散后特征可以进行特征交叉, 提升表达能力, 由 M+N 个变量变成　Ｍ*N 个变量, 进一步引入非线性, 提升了表达能力
# 5. 特征离散后模型更稳定, 如用户年龄区间, 不会因为用户年龄长了一岁就变化
# 当然还有很多原因,　LightGBM 在改进 XGBoost 时就增加了数据分桶, 增强了模型的泛化性

bin = [i*10 for i in range(31)]
data["power_bin"] = pd.cut(data["power"], bin, labels=False) # 分桶 cut切分数据(必须是一维的) bin定义区间 labels=False返回第几个bin（从0开始）
print(data[["power_bin", "power"]].head())

 power_bin  power
0        5.0     60
1        NaN      0
2       16.0    163
3       19.0    193
4        6.0     68

1
2
3

# 删除不需要的数据
data = data.drop(["creatDate", "regDate", "regionCode"], axis=1) # drop函数默认删除行，列需要加axis = 1
print(data.shape)

(199037, 39)

1	print(data.columns)

Index(['SaleID', 'name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox',
       'power', 'kilometer', 'notRepairedDamage', 'seller', 'offerType',
       'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8',
       'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14', 'train', 'used_time',
       'city', 'brand_amount', 'brand_price_max', 'brand_price_median',
       'brand_price_min', 'brand_price_sum', 'brand_price_std',
       'brand_price_average', 'power_bin'],
      dtype='object')

导出数据

1 2	# 目前的数据其实已经可以给树模型使用了, 所以我们导出一下 data.to_csv("data_for_tree.csv", index=0) # index=0不保存行索引

特征构造

# 我们可以再构造一份特征给 LR NN 之类的模型用
# 之所以分开构造是因为, 不同模型对数据的要求不同
# 先看下数据分布：
data["power"].plot.hist()
plt.show()

# 我们刚刚已经对 train 进行异常值处理了，但是现在还有这么奇怪的分布是因为 test 中的 power 异常值，
# 所以我们其实刚刚 train 中的 power 异常值不删为好，可以用长尾分布截断来代替
Train_data["power"].plot.hist()
plt.show()

归一化

# 我们对其取 log, 再做归一化
from sklearn import preprocessing
# 将数据的每一个特征缩放到给定的范围，将数据的每一个属性值减去其最小值，然后除以其极差（最大值 - 最小值）
min_max_scaler = preprocessing.MinMaxScaler()
data["power"] = np.log(data["power"] + 1)
# 归一化：(0,1)标准化
data["power"] = ((data["power"] - np.min(data["power"])) / (np.max(data["power"]) - np.min(data["power"])))
data["power"].plot.hist()
plt.show()

1
2
3

# km 的比较正常, 应该已经做过分桶了
data["kilometer"].plot.hist()
plt.show()

# 所以可以直接作归一化
data["kilometer"] = ((data["kilometer"] - np.min(data["kilometer"])) /
                     (np.max(data["kilometer"]) - np.min(data["kilometer"])))
data["kilometer"].plot.hist()
plt.show()

# 除此之外 还有我们刚刚构造的统计量特征：
# 'brand_amount', 'brand_price_average', 'brand_price_max',
# 'brand_price_median', 'brand_price_min', 'brand_price_std',
# 'brand_price_sum'
# 这里不再一一举例分析了，直接做变换，
def max_min(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))

data['brand_amount'] = ((data['brand_amount'] - np.min(data['brand_amount'])) /
                        (np.max(data['brand_amount']) - np.min(data['brand_amount'])))
data['brand_price_average'] = ((data['brand_price_average'] - np.min(data['brand_price_average'])) /
                               (np.max(data['brand_price_average']) - np.min(data['brand_price_average'])))
data['brand_price_max'] = ((data['brand_price_max'] - np.min(data['brand_price_max'])) /
                           (np.max(data['brand_price_max']) - np.min(data['brand_price_max'])))
data['brand_price_median'] = ((data['brand_price_median'] - np.min(data['brand_price_median'])) /
                              (np.max(data['brand_price_median']) - np.min(data['brand_price_median'])))
data['brand_price_min'] = ((data['brand_price_min'] - np.min(data['brand_price_min'])) /
                           (np.max(data['brand_price_min']) - np.min(data['brand_price_min'])))
data['brand_price_std'] = ((data['brand_price_std'] - np.min(data['brand_price_std'])) /
                           (np.max(data['brand_price_std']) - np.min(data['brand_price_std'])))
data['brand_price_sum'] = ((data['brand_price_sum'] - np.min(data['brand_price_sum'])) /
                           (np.max(data['brand_price_sum']) - np.min(data['brand_price_sum'])))



# 对类别特征进行 OneEncoder

data = pd.get_dummies(data, columns=['model', 'brand', 'bodyType', 'fuelType',
                                     'gearbox', 'notRepairedDamage', 'power_bin']) # 装换虚伪变量

print(data.shape)

(199037, 370)

1	print(data.columns)

Index(['SaleID', 'name', 'power', 'kilometer', 'seller', 'offerType', 'price',
       'v_0', 'v_1', 'v_2',
       ...
       'power_bin_20.0', 'power_bin_21.0', 'power_bin_22.0', 'power_bin_23.0',
       'power_bin_24.0', 'power_bin_25.0', 'power_bin_26.0', 'power_bin_27.0',
       'power_bin_28.0', 'power_bin_29.0'],
      dtype='object', length=370)

导出数据

1 2	# 这份数据可以给 LR 用 data.to_csv("data_for_lr.csv", index=0)

特征筛选

过滤式

# 1）过滤式
# 相关性分析
print(data['power'].corr(data['price'], method='spearman')) #spearman：非线性的，非正太分析的数据的相关系数
print(data['kilometer'].corr(data['price'], method='spearman'))
print(data['brand_amount'].corr(data['price'], method='spearman'))
print(data['brand_price_average'].corr(data['price'], method='spearman'))
print(data['brand_price_max'].corr(data['price'], method='spearman'))
print(data['brand_price_median'].corr(data['price'], method='spearman'))

0.5728285196051496
-0.4082569701616764
0.058156610025581514
0.3834909576057687
0.259066833880992
0.38691042393409447

# 当然也可以直接看图
data_numeric = data[['power', 'kilometer', 'brand_amount', 'brand_price_average',
                     'brand_price_max', 'brand_price_median']]
correlation = data_numeric.corr() #返回data_numeric 相关性矩阵

f, ax = plt.subplots(figsize=(7,7))
plt.title("Correlation of Numeric Features with Price", y=1, size=16)
square=True # 将坐标轴方向设置为“equal”，以使每个单元格为方形 , vmax:色彩映射的值
sns.heatmap(correlation, square=True, vmax=0.8)
plt.show()

包裹式

下面的代码运行错误，看不懂

# # 2)包裹式
from mlxtend.feature_selection import SequentialFeatureSelector as SFS #序列特征算法的实现——贪婪搜索算法
from sklearn.linear_model import  LinearRegression # 基于最小二乘法的线性回归
sfs = SFS(LinearRegression(), # 分类器或回归矩阵
          k_features=10, # 要选择的特征数量
          forward=True, #  如果为True，则向前选择，否则为反向选择
          floating=False, # 如果为True，则添加条件排除/包含。
          scoring="r2", # 对于sklearn回归变量使用“ r2”
          cv=0) # 如果cv为None、False或0，则不进行交叉验证
x = data.drop(["price"], axis=1)
x = x.fillna(0)
y = data["price"]

sfs.fit(x, y) # 执行特征选择并从训练数据中学习模型 x训练样本 y目标值
sfs.k_feature_names_


# 画出来，可以看到边际效益
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
plt.grid()

嵌入式

1 2	# 下一章介绍，Lasso 回归和决策树可以完成嵌入式特征选择 # 大部分情况下都是用嵌入式做特征筛选

代码片段

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from operator import itemgetter

# %matplotlib inline 在终端中可以代替plt。show() 直接生成图

Train_data = pd.read_csv("./datalab/used_car_train_20200313.csv", sep=" ")
Test_data = pd.read_csv("./datalab/used_car_testA_20200313.csv", sep=" ")

# print(Train_data.shape)
# print(Train_data.head())

#print(Train_data.columns)

## 删除异常值
# 这里我包装了一个异常值处理的代码，可以随便调用
def outliers_proc(data, col_name, scale=3):
    """
    用于清洗异常值，默认用box_plot(scale=3)进行清洗
    :param data:接受 pandas 数据格式
    :param col_name:pandas 列名
    :param scale:尺度
    """
    def box_plot_outliers(data_ser, box_scale):
        """
        利用箱线图去除异常值
        :param data_ser:接收 pandas.Series 数据格式
        :param box_scale: 箱线图尺度 （规定大于上四分位数1.5倍四分位数差 的值，或者小于下四分位数1.5倍四分位数差的值，划为异常值）
        """
        iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25)) # 3倍四分位数差
        val_low = data_ser.quantile(0.25) - iqr # 下限=Q1-3IQR
        val_up = data_ser.quantile(0.75) + iqr # 上限=Q3+3IQR
        rule_low = (data_ser < val_low)
        rule_up = (data_ser > val_up) # 返回 pandas.Series 中对应值的bool
        return (rule_low, rule_up), (val_low, val_up)

    data_n = data.copy() # copy 数据
    data_series = data_n[col_name] # 返回指定 col_name 数据
    rule, value = box_plot_outliers(data_series, box_scale=scale)
    index = np.arange(data_series.shape[0])[rule[0]|rule[1]] # 返回rule_low, rule_up中为True的下标的列表
    #print("Delete number is:{}".format(len(index))) # 打印下标列表中个数
    data_n = data_n.drop(index) # 删除(删除后下标没变)
    data_n.reset_index(drop=True, inplace=True) # 重置索引（drop=True删除原来的索引;inplace=True当前修改状态应用到原来Series中）
    #print("Now column number is:{}".format(data_n.shape[0])) # 查看删除后的数据个数
    index_low = np.arange(data_series.shape[0])[rule[0]]
    outliers = data_series.iloc[index_low] # ilco-按下标进行索引
    #print("Description of data larger than the lower bound is:")
    #print(pd.Series(outliers).describe())
    index_up = np.arange(data_series.shape[0])[rule[1]]
    outliers = data_series.iloc[index_up]
    #print("Description of data larger than the upper bound is:")
    #print(pd.Series(outliers).describe())

    #fig, ax = plt.subplots(1, 2, figsize=(10, 7)) # 创建子图:1行2列
    #sns.boxplot(y=data[col_name], data=data, palette="Set1", ax=ax[0]) # 箱线图
    #sns.boxplot(y=data_n[col_name], data=data_n, palette="Set1", ax=ax[1])
    #plt.show()
    return data_n


## 我们可以删掉一些异常数据,以 power 为例
## 这里删不删可以自行判断
## 但是注意 Test 的数据不能删
Train_data = outliers_proc(Train_data, "power", scale=3)



## 特征构造
# 训练集和测试集放在一起，方便构造特征
Train_data["train"] = 1 # 添加新字段，并设置值为1
Test_data["train"] = 1
data = pd.concat([Train_data,Test_data],ignore_index=True) # 连接函数 ignore_index=True重置索引

# 使用时间：data["createDate"] - data["regDate"], 反应汽车使用时间，一般来说价格与使用时间成反比
# 不过要注意, 数据里有时间出错的格式, 所以我们需要 errors = "coerce"
data["used_time"] = (pd.to_datetime(data["creatDate"], format="%Y%m%d", errors="coerce") -
pd.to_datetime(data["regDate"],format="%Y%m%d",errors="coerce")).dt.days # to_datetime将参数转换为日期 dt.days每个元素的天数

# 看一下空数据, 有 15k 个样本的时间有问题的, 我们可以选择删除, 也可以选择放着
# 但是这里不建议删除, 因为删除缺失数据占总样本量过大, 7.5%
# 我们可以先放着, 因为如果我们 XGBoost 之类的决策树, 其本身就能处理缺失值, 所以可以不用管
#print(data["used_time"].isnull().sum())

# 从邮编中提取城市信息, 相当于加入了先验知识
#print(data["regionCode"])
# 增加city 字段, 并从 regionCode 值的倒数第三位切片(apply 对regionCode每个元素运行指定运算 lambda 匿名函数)
data["city"] = data["regionCode"].apply(lambda x : str(x)[:-3])

# 计算某品牌的销售统计量, 还可以计算其他特征的统计量
# 这里以 train 的数据计算统计量
Train_gb = Train_data.groupby("brand") # 分组
all_info = {}
for kind, kind_data in Train_gb:
    info = {}
    kind_data = kind_data[kind_data["price"] > 0]
    info["brand_amount"] = len(kind_data)
    info["brand_price_max"] = kind_data.price.max()
    info["brand_price_median"] = kind_data.price.median()
    info["brand_price_min"] = kind_data.price.min()
    info["brand_price_sum"] = kind_data.price.sum()
    info["brand_price_std"] = kind_data.price.std() # 样本方差
    info["brand_price_average"] = round(kind_data.price.sum() / (len(kind_data)+1), 2) # round(2)取近似值保留两位数
    all_info[kind] = info

brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index":"brand"}) # T转置   reset_index还原索引  rename并修改列名
data = data.merge(brand_fe, how="left", on="brand") # 合并数据 data链接在brand_fe "brand"字段左边




# 数据分桶 以 power 为例
# 这时候缺失值也进桶了
# 为什么要分桶：
# 1. 离散后稀疏向量内积乘法运算速度更快, 计算结果也方便存储, 容易扩展
# 2. 离散后的特征对异常值更具鲁棒性, 如 age>30 为 1 否则为 0 , 对于年龄为 200 的也不会对模型造成很大的干扰
# 3. LR 属于广义线性模型, 表达能力有限, 经过离散化后, 每个变量有单独的权重, 这相当于引入了非线性, 能够提升模型的表达能力, 加大拟合
# 4. 离散后特征可以进行特征交叉, 提升表达能力, 由 M+N 个变量变成　Ｍ*N 个变量, 进一步引入非线性, 提升了表达能力
# 5. 特征离散后模型更稳定, 如用户年龄区间, 不会因为用户年龄长了一岁就变化
# 当然还有很多原因,　LightGBM 在改进 XGBoost 时就增加了数据分桶, 增强了模型的泛化性

bin = [i*10 for i in range(31)]
data["power_bin"] = pd.cut(data["power"], bin, labels=False) # 分桶 cut切分数据(必须是一维的) bin定义区间 labels=False返回第几个bin（从0开始）
#print(data[["power_bin", "power"]].head())

# 删除不需要的数据
data = data.drop(["creatDate", "regDate", "regionCode"], axis=1) # drop函数默认删除行，列需要加axis = 1
#print(data.shape)
#print(data.columns)

# 目前的数据其实已经可以给树模型使用了, 所以我们导出一下
#data.to_csv("data_for_tree.csv", index=0) # index=0不保存行索引


# 我们可以再构造一份特征给 LR NN 之类的模型用
# 之所以分开构造是因为, 不同模型对数据的要求不同
# 先看下数据分布：
#data["power"].plot.hist()
#plt.show()

# 我们刚刚已经对 train 进行异常值处理了，但是现在还有这么奇怪的分布是因为 test 中的 power 异常值，
# 所以我们其实刚刚 train 中的 power 异常值不删为好，可以用长尾分布截断来代替
#Train_data["power"].plot.hist()
#plt.show()

# 我们对其取 log, 再做归一化
from sklearn import preprocessing
# 将数据的每一个特征缩放到给定的范围，将数据的每一个属性值减去其最小值，然后除以其极差（最大值 - 最小值）
min_max_scaler = preprocessing.MinMaxScaler()
data["power"] = np.log(data["power"] + 1)
# 归一化：(0,1)标准化
data["power"] = ((data["power"] - np.min(data["power"])) / (np.max(data["power"]) - np.min(data["power"])))
#data["power"].plot.hist()
#plt.show()

# km 的比较正常, 应该已经做过分桶了
# data["kilometer"].plot.hist()
# plt.show()

# 所以可以直接作归一化
data["kilometer"] = ((data["kilometer"] - np.min(data["kilometer"])) /
                     (np.max(data["kilometer"]) - np.min(data["kilometer"])))
#data["kilometer"].plot.hist()
#plt.show()

# 除此之外 还有我们刚刚构造的统计量特征：
# 'brand_amount', 'brand_price_average', 'brand_price_max',
# 'brand_price_median', 'brand_price_min', 'brand_price_std',
# 'brand_price_sum'
# 这里不再一一举例分析了，直接做变换，
def max_min(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))

data['brand_amount'] = ((data['brand_amount'] - np.min(data['brand_amount'])) /
                        (np.max(data['brand_amount']) - np.min(data['brand_amount'])))
data['brand_price_average'] = ((data['brand_price_average'] - np.min(data['brand_price_average'])) /
                               (np.max(data['brand_price_average']) - np.min(data['brand_price_average'])))
data['brand_price_max'] = ((data['brand_price_max'] - np.min(data['brand_price_max'])) /
                           (np.max(data['brand_price_max']) - np.min(data['brand_price_max'])))
data['brand_price_median'] = ((data['brand_price_median'] - np.min(data['brand_price_median'])) /
                              (np.max(data['brand_price_median']) - np.min(data['brand_price_median'])))
data['brand_price_min'] = ((data['brand_price_min'] - np.min(data['brand_price_min'])) /
                           (np.max(data['brand_price_min']) - np.min(data['brand_price_min'])))
data['brand_price_std'] = ((data['brand_price_std'] - np.min(data['brand_price_std'])) /
                           (np.max(data['brand_price_std']) - np.min(data['brand_price_std'])))
data['brand_price_sum'] = ((data['brand_price_sum'] - np.min(data['brand_price_sum'])) /
                           (np.max(data['brand_price_sum']) - np.min(data['brand_price_sum'])))



# 对类别特征进行 OneEncoder

data = pd.get_dummies(data, columns=['model', 'brand', 'bodyType', 'fuelType',
                                     'gearbox', 'notRepairedDamage', 'power_bin']) # 装换虚伪变量

#print(data.shape)
#print(data.columns)

# 这份数据可以给 LR 用
#data.to_csv("data_for_lr.csv", index=0)






# 特征筛选
# 1）过滤式
# 相关性分析
# print(data['power'].corr(data['price'], method='spearman')) #spearman：非线性的，非正太分析的数据的相关系数
# print(data['kilometer'].corr(data['price'], method='spearman'))
# print(data['brand_amount'].corr(data['price'], method='spearman'))
# print(data['brand_price_average'].corr(data['price'], method='spearman'))
# print(data['brand_price_max'].corr(data['price'], method='spearman'))
# print(data['brand_price_median'].corr(data['price'], method='spearman'))

# 当然也可以直接看图
# data_numeric = data[['power', 'kilometer', 'brand_amount', 'brand_price_average',
#                      'brand_price_max', 'brand_price_median']]
# correlation = data_numeric.corr() #返回data_numeric 相关性矩阵
#
# f, ax = plt.subplots(figsize=(7,7))
# plt.title("Correlation of Numeric Features with Price", y=1, size=16)
# square=True # 将坐标轴方向设置为“equal”，以使每个单元格为方形 , vmax:色彩映射的值
# sns.heatmap(correlation, square=True, vmax=0.8)
# plt.show()



# # 2)包裹式
from mlxtend.feature_selection import SequentialFeatureSelector as SFS #序列特征算法的实现——贪婪搜索算法
from sklearn.linear_model import  LinearRegression # 基于最小二乘法的线性回归
sfs = SFS(LinearRegression(), # 分类器或回归矩阵
          k_features=10, # 要选择的特征数量
          forward=True, #  如果为True，则向前选择，否则为反向选择
          floating=False, # 如果为True，则添加条件排除/包含。
          scoring="r2", # 对于sklearn回归变量使用“ r2”
          cv=0) # 如果cv为None、False或0，则不进行交叉验证
x = data.drop(["price"], axis=1)
x = x.fillna(0)
y = data["price"]

sfs.fit(x, y) # 执行特征选择并从训练数据中学习模型 x训练样本 y目标值
sfs.k_feature_names_


# 画出来，可以看到边际效益
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
plt.grid()

经验总结

特征工程是比赛中最至关重要的的一块，特别的传统的比赛，大家的模型可能都差不多，调参带来的效果增幅是非常有限的，但特征工程的好坏往往会决定了最终的排名和成绩。

特征工程的主要目的还是在于将数据转换为能更好地表示潜在问题的特征，从而提高机器学习的性能。比如，异常值处理是为了去除噪声，填补缺失值可以加入先验知识等。

特征构造也属于特征工程的一部分，其目的是为了增强数据的表达。

有些比赛的特征是匿名特征，这导致我们并不清楚特征相互直接的关联性，这时我们就只有单纯基于特征进行处理，比如装箱，groupby，agg 等这样一些操作进行一些特征统计，此外还可以对特征进行进一步的 log，exp 等变换，或者对多个特征进行四则运算（如上面我们算出的使用时长），多项式组合等然后进行筛选。由于特性的匿名性其实限制了很多对于特征的处理，当然有些时候用 NN 去提取一些特征也会达到意想不到的良好效果。

对于知道特征含义（非匿名）的特征工程，特别是在工业类型比赛中，会基于信号处理，频域提取，峰度，偏度等构建更为有实际意义的特征，这就是结合背景的特征构建，在推荐系统中也是这样的，各种类型点击率统计，各时段统计，加用户属性的统计等等，这样一种特征构建往往要深入分析背后的业务逻辑或者说物理原理，从而才能更好的找到 magic。

当然特征工程其实是和模型结合在一起的，这就是为什么要为 LR NN 做分桶和特征归一化的原因，而对于特征的处理效果和特征重要性等往往要通过模型来验证。

总的来说，特征工程是一个入门简单，但想精通非常难的一件事。

Datawhale零基础入门数据挖掘-Task2

发表于 2020-03-24 更新于 2020-03-26 分类于数据挖掘及机器学习阅读次数：
本文字数： 33k 阅读时长 ≈ 30 分钟

EDA的价值主要在于熟悉数据集，了解数据集，对数据集进行验证来确定所获得数据集可以用于接下来的机器学习或者深度学习使用
当了解了数据集之后我们下一步就是要去了解变量间的相互关系以及变量与预测值之间的存在关系
进行数据处理以及特征工程,使数据集的结构和特征集让接下来的预测问题更加可靠

载入各种数据科学以及可视化库

# 导入warnings包，利用过滤器来实现忽略警告语句
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

载入数据

## pd.set_option('display.max_columns', None)# 显示所有列
## pd.set_option('display.max_row', None)# 显示所有行
## 1)载入训练集和测试集
Train_data = pd.read_csv("./datalab/used_car_train_20200313.csv", sep = " ")
Test_data = pd.read_csv("./datalab/used_car_testA_20200313.csv", sep = " ")

以下主要以Train_data为例

简略观察数据

1 2	## 2)简略观察数据（head()+shape) print(Train_data.head().append(Train_data.tail()))

	SaleID	name	regDate	model	…	v_11	v_12	v_13	v_14
0	0	736	20040402	30.0	…	2.804097	-2.420821	0.795292	0.914762
1	1	2262	20030301	40.0	…	2.096338	-1.030483	-1.722674	0.245522
2	2	14874	20040403	115.0	…	1.803559	1.565330	-0.832687	-0.229963
3	3	71865	19960908	109.0	…	1.285940	-0.501868	-2.438353	-0.478699
4	4	111080	20120103	110.0	…	0.910783	0.931110	2.834518	1.923482
149995	149995	163978	20000607	121.0	…	-2.983973	0.589167	-1.304370	-0.302592
149996	149996	184535	20091102	116.0	…	-2.774615	2.553994	0.924196	-0.272160
149997	149997	147587	20101003	60.0	…	-1.630677	2.290197	1.891922	0.414931
149998	149998	45907	20060312	34.0	…	-2.633719	1.414937	0.431981	-1.659014
149999	149999	177672	19990204	19.0	…	-3.179913	0.031724	-1.483350	-0.342674

[10 rows x 31 columns]

1	print(Train_data.shape)

(150000, 31)

总览数据概况

describe种有每列的统计量，个数count、平均值mean、方差std、最小值min、中位数25% 50% 75% 、以及最大值看这个信息主要是瞬间掌握数据的大概的范围以及每个值的异常值的判断，比如有的时候会发现999 9999 -1 等值这些其实都是nan的另外一种表达方式，有的时候需要注意下
info 通过info来了解数据每列的type，有助于了解是否存在除了nan以外的特殊符号异常

通过describe()来熟悉相关统计量

1 2	## 3)通过describe()来熟悉相关统计量 print(Train_data.describe())

	SaleID	name	…	v_13	v_14
count	150000.000000	150000.000000	…	150000.000000	150000.000000
mean	74999.500000	68349.172873	…	0.000313	-0.000688
std	43301.414527	61103.875095	…	1.288988	1.038685
min	0.000000	0.000000	…	-4.153899	-6.546556
25%	37499.750000	11156.000000	…	-1.057789	-0.437034
50%	74999.500000	51638.000000	…	-0.036245	0.141246
75%	112499.250000	118841.250000	…	0.942813	0.680378
max	149999.000000	196812.000000	…	11.147669	8.658418

[8 rows x 30 columns]

通过info()来熟悉数据类型

1 2	## 4)通过info()来熟悉数据类型 print(Train_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_0                150000 non-null  float64
 17  v_1                150000 non-null  float64
 18  v_2                150000 non-null  float64
 19  v_3                150000 non-null  float64
 20  v_4                150000 non-null  float64
 21  v_5                150000 non-null  float64
 22  v_6                150000 non-null  float64
 23  v_7                150000 non-null  float64
 24  v_8                150000 non-null  float64
 25  v_9                150000 non-null  float64
 26  v_10               150000 non-null  float64
 27  v_11               150000 non-null  float64
 28  v_12               150000 non-null  float64
 29  v_13               150000 non-null  float64
 30  v_14               150000 non-null  float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
None

判断数据缺失和异常

查看每列的存在nan情况

1 2	## 5) 查看每列的存在nan情况 print(Train_data.isnull().sum())

SaleID                  0
name                    0
regDate                 0
model                   1
brand                   0
bodyType             4506
fuelType             8680
gearbox              5981
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
price                   0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64

nan可视化

#nan可视化
missing = Train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True) # 排序
missing.plot.bar() # 绘柱状图
plt.tight_layout() # 自动调整子图参数
plt.show()

通过以上可以很直观的了解哪些列存在 “nan”, 并可以把nan的个数打印，主要的目的在于 nan存在的个数是否真的很大，如果很小一般选择填充，如果使用lgb等树模型可以直接空缺，让树自己去优化，但如果nan存在的过多、可以考虑删掉

可视化看下缺省值

1
2
3

## 可视化看下缺省值
msno.matrix(Train_data.sample(250))
plt.show()

1 2	msno.bar(Train_data.sample(1000)) # 条形图 plt.show()

查看异常值检测

通过前面info()来熟悉数据类型，可以发现除了notRepairedDamage 为object类型其他都为数字这里我们把他的几个不同的值都进行显示就知道了
1
print(Train_data["notRepairedDamage"].value_counts()) # 返回包含值和count
0.0 111361
- 24324
1.0 14315
Name: notRepairedDamage, dtype: int64
可以看出来‘ - ’也为空缺值，因为很多模型对nan有直接的处理，这里我们先不做处理，先替换成nan

1 2	Train_data["notRepairedDamage"].replace("-", np.nan, inplace=True) # 将数据中‘-’替换成nan值 print(Train_data["notRepairedDamage"].value_counts())

0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64

再查看nan值情况

1	print(Train_data.isnull().sum())

SaleID                   0
name                     0
regDate                  0
model                    1
brand                    0
bodyType              4506
fuelType              8680
gearbox               5981
power                    0
kilometer                0
notRepairedDamage    24324
regionCode               0
seller                   0
offerType                0
creatDate                0
price                    0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
dtype: int64

以下两个类别特征严重倾斜，一般不会对预测有什么帮助，故这边先删掉，当然你也可以继续挖掘，但是一般意义不大

1	print(Train_data["seller"].value_counts())

0    149999
1         1
Name: seller, dtype: int64

1	print(Train_data["offerType"].value_counts())

0    150000
Name: offerType, dtype: int64

# 删除严重倾斜的数据
del Train_data["seller"]
del Train_data["offerType"]
print(Train_data.info())
print(Train_data.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 29 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  creatDate          150000 non-null  int64  
 13  price              150000 non-null  int64  
 14  v_0                150000 non-null  float64
 15  v_1                150000 non-null  float64
 16  v_2                150000 non-null  float64
 17  v_3                150000 non-null  float64
 18  v_4                150000 non-null  float64
 19  v_5                150000 non-null  float64
 20  v_6                150000 non-null  float64
 21  v_7                150000 non-null  float64
 22  v_8                150000 non-null  float64
 23  v_9                150000 non-null  float64
 24  v_10               150000 non-null  float64
 25  v_11               150000 non-null  float64
 26  v_12               150000 non-null  float64
 27  v_13               150000 non-null  float64
 28  v_14               150000 non-null  float64
dtypes: float64(20), int64(8), object(1)
memory usage: 33.2+ MB
None

(150000, 29)

了解预测值的分布

1 2	print(Train_data["price"]) print(Train_data["price"].value_counts())

0         1850
1         3600
2         6222
3         2400
4         5200
      ... 
149995    5900
149996    9500
149997    7500
149998    4999
149999    4700
Name: price, Length: 150000, dtype: int64

500      2337
1500     2158
1200     1922
1000     1850
2500     1821
     ... 
25321       1
8886        1
8801        1
37920       1
8188        1
Name: price, Length: 3763, dtype: int64

总体分布情况(无界约翰逊分布等）

## 1)总体分布情况(无界约翰逊分布等）
import scipy.stats as st
y = Train_data["price"]
plt.figure(1); plt.title("Johnson SU") # 创建新图
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title("Normal")
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title("Log Normal")
sns.distplot(y, kde=False, fit=st.lognorm)
plt.show() # 最佳拟合是无界约翰逊分布

价格不服从正态分布，所以在进行回归之前，它必须进行转换。虽然对数变换做得很好，但最佳拟合是无界约翰逊分布

查看skewness and kurtosis

## 2)查看skewness and kurtosis
sns.distplot(Train_data["price"])
print("Skewness: %f" % Train_data["price"].skew()) # 偏度
print("Kurtosis: %f" % Train_data["price"].kurt()) # 峰度
plt.show()

Skewness: 3.346487
Kurtosis: 18.995183

1 2	print(Train_data.skew()) print(Train_data.kurt())

SaleID               6.017846e-17
name                 5.576058e-01
regDate              2.849508e-02
model                1.484388e+00
brand                1.150760e+00
bodyType             9.915299e-01
fuelType             1.595486e+00
gearbox              1.317514e+00
power                6.586318e+01
kilometer           -1.525921e+00
notRepairedDamage    2.430640e+00
regionCode           6.888812e-01
creatDate           -7.901331e+01
price                3.346487e+00
v_0                 -1.316712e+00
v_1                  3.594543e-01
v_2                  4.842556e+00
v_3                  1.062920e-01
v_4                  3.679890e-01
v_5                 -4.737094e+00
v_6                  3.680730e-01
v_7                  5.130233e+00
v_8                  2.046133e-01
v_9                  4.195007e-01
v_10                 2.522046e-02
v_11                 3.029146e+00
v_12                 3.653576e-01
v_13                 2.679152e-01
v_14                -1.186355e+00
dtype: float64

SaleID                 -1.200000
name                   -1.039945
regDate                -0.697308
model                   1.740483
brand                   1.076201
bodyType                0.206937
fuelType                5.880049
gearbox                -0.264161
power                5733.451054
kilometer               1.141934
notRepairedDamage       3.908072
regionCode             -0.340832
creatDate            6881.080328
price                  18.995183
v_0                     3.993841
v_1                    -1.753017
v_2                    23.860591
v_3                    -0.418006
v_4                    -0.197295
v_5                    22.934081
v_6                    -1.742567
v_7                    25.845489
v_8                    -0.636225
v_9                    -0.321491
v_10                   -0.577935
v_11                   12.568731
v_12                    0.268937
v_13                   -0.438274
v_14                    2.393526
dtype: float64

1 2	sns.distplot(Train_data.skew(), color="blue", axlabel="Skewness") plt.show()

1 2	sns.distplot(Train_data.kurt(), color="orange", axlabel="Kurtness") plt.show()

skew、kurt说明参考https://www.cnblogs.com/wyy1480/p/10474046.html

查看预测值的具体频数

1
2
3

# 3)查看预测值的具体频数
plt.hist(Train_data["price"], orientation="vertical", histtype="bar", color="red")
plt.show() # 直方图

查看频数, 大于20000得值极少，其实这里也可以把这些当作特殊得值（异常值）直接用填充或者删掉，在前面进行

1
2
3

# log变换之后的分布比较均匀，可以进行log变换进行预测，这也是预测问题常用的trick
plt.hist(np.log(Train_data["price"]), orientation="vertical", histtype="bar", color="red")
plt.show()

log变换之后的分布较均匀，可以进行log变换进行预测，这也是预测问题常用的trick

特征分为类别特征和数字特征，并对类别特征查看nunique分布

数据类型

name - 汽车编码
regDate - 汽车注册时间
model - 车型编码
brand - 品牌
bodyType - 车身类型
fuelType - 燃油类型
gearbox - 变速箱
power - 汽车功率
kilometer - 汽车行驶公里
notRepairedDamage - 汽车有尚未修复的损坏
regionCode - 看车地区编码
seller - 销售方【以删】
offerType - 报价类型【以删】
creatDate - 广告发布时间
price - 汽车价格
v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,’v_14’【匿名特征，包含v0-14在内15个匿名特征】

1 2	# 分离label即预测值 Y_train = Train_data['price']

# 这个区别方式适用于没有直接label coding的数据
# 这里不适用，需要人为根据实际含义来区分
# 数字特征
# numeric_features = Train_data.select_dtypes(include=[np.number])
# numeric_features.columns
# # 类型特征
# categorical_features = Train_data.select_dtypes(include=[np.object])
# categorical_features.columns

# 数字特征
#numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]
# 类别特征
#categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode']

## 类别特征nunique分布——Train_data
for cat_fea in categorical_features:
    print(cat_fea+"的特征分布如下：")
    print("{}特征有{}个不同的值".format(cat_fea, Train_data[cat_fea].nunique()))
    print(Train_data[cat_fea].value_counts())

name的特征分布如下：
name特征有99662个不同的值
708       282
387       282
55        280
1541      263
203       233
     ... 
5074        1
7123        1
11221       1
13270       1
174485      1
Name: name, Length: 99662, dtype: int64

model的特征分布如下：
model特征有248个不同的值
0.0      11762
19.0      9573
4.0       8445
1.0       6038
29.0      5186
     ...  
245.0        2
209.0        2
240.0        2
242.0        2
247.0        1
Name: model, Length: 248, dtype: int64

brand的特征分布如下：
brand特征有40个不同的值
0     31480
4     16737
14    16089
10    14249
1     13794
6     10217
9      7306
5      4665
13     3817
11     2945
3      2461
7      2361
16     2223
8      2077
25     2064
27     2053
21     1547
15     1458
19     1388
20     1236
12     1109
22     1085
26      966
30      940
17      913
24      772
28      649
32      592
29      406
37      333
2       321
31      318
18      316
36      228
34      227
33      218
23      186
35      180
38       65
39        9
Name: brand, dtype: int64

bodyType的特征分布如下：
bodyType特征有8个不同的值
0.0    41420
1.0    35272
2.0    30324
3.0    13491
4.0     9609
5.0     7607
6.0     6482
7.0     1289
Name: bodyType, dtype: int64

fuelType的特征分布如下：
fuelType特征有7个不同的值
0.0    91656
1.0    46991
2.0     2212
3.0      262
4.0      118
5.0       45
6.0       36
Name: fuelType, dtype: int64

gearbox的特征分布如下：
gearbox特征有2个不同的值
0.0    111623
1.0     32396
Name: gearbox, dtype: int64

notRepairedDamage的特征分布如下：
notRepairedDamage特征有2个不同的值
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64

regionCode的特征分布如下：
regionCode特征有7905个不同的值
419     369
764     258
125     137
176     136
462     134
       ... 
6414      1
7063      1
4239      1
5931      1
7267      1
Name: regionCode, Length: 7905, dtype: int64

数字特征分析

1 2	numeric_features.append("price") print(numeric_features)

['power', 
'kilometer', 
'v_0', 
'v_1', 
'v_2', 
'v_3', 
'v_4', 
'v_5', 
'v_6', 
'v_7', 
'v_8', 
'v_9', 
'v_10', 
'v_11', 
'v_12', 
'v_13', 
'v_14', 
'price']

1	print(Train_data.head())

	SaleID	name	regDate	model	…	v_11	v_12	v_13	v_14
0	0	736	20040402	30.0	…	2.804097	-2.420821	0.795292	0.914762
1	1	2262	20030301	40.0	…	2.096338	-1.030483	-1.722674	0.245522
2	2	14874	20040403	115.0	…	1.803559	1.565330	-0.832687	-0.229963
3	3	71865	19960908	109.0	…	1.285940	-0.501868	-2.438353	-0.478699
4	4	111080	20120103	110.0	…	0.910783	0.931110	2.834518	1.923482

[5 rows x 29 columns]

查看几个特征的偏度和峰度

## 2)查看几个特征的偏度和峰度
for col in numeric_features:
    print("{:15}".format(col),"Skewness:{:05.2f}".format(Train_data[col].skew()),
    "   ",
          "Kurtosis:{:06.2f}".format(Train_data[col].kurt()))

power           Skewness:65.86     Kurtosis:5733.45
kilometer       Skewness:-1.53     Kurtosis:001.14
v_0             Skewness:-1.32     Kurtosis:003.99
v_1             Skewness:00.36     Kurtosis:-01.75
v_2             Skewness:04.84     Kurtosis:023.86
v_3             Skewness:00.11     Kurtosis:-00.42
v_4             Skewness:00.37     Kurtosis:-00.20
v_5             Skewness:-4.74     Kurtosis:022.93
v_6             Skewness:00.37     Kurtosis:-01.74
v_7             Skewness:05.13     Kurtosis:025.85
v_8             Skewness:00.20     Kurtosis:-00.64
v_9             Skewness:00.42     Kurtosis:-00.32
v_10            Skewness:00.03     Kurtosis:-00.58
v_11            Skewness:03.03     Kurtosis:012.57
v_12            Skewness:00.37     Kurtosis:000.27
v_13            Skewness:00.27     Kurtosis:-00.44
v_14            Skewness:-1.19     Kurtosis:002.39
price           Skewness:03.35     Kurtosis:019.00

每个数字特征得分布可视化

## 3)每个数字特征得分布可视化
f = pd.melt(Train_data, value_vars=numeric_features) # 转换
g = sns.FacetGrid(f,col="variable", col_wrap=2, sharex=False,sharey=False) # 以”variable“作“格子"绘图
# plt.show()
g = g.map(sns.distplot, "value") # 以”value“绘制到”格子”图中
plt.show()

可以看出匿名特征相对分布均匀

数字特征相互之间的关系可视化

## 4）数字特征相互之间的关系可视化
sns.set() # 风格设置
colunms = ["price", "v_12", "v_8", "v_0", "power", "v_5", "v_2", "v_6", "v_1", "v_14"]
sns.pairplot(Train_data[colunms],size=2, kind="scatter", diag_kind="kde") # 多变量图
plt.show()

多变量互相回归关系可视化

此处是多变量之间的关系可视化，可视化更多学习可参考很不错的文章https://www.jianshu.com/p/6e18d21a4cad

1	print(Train_data.columns)

Index([‘SaleID’, ‘name’, ‘regDate’, ‘model’, ‘brand’, ‘bodyType’, ‘fuelType’,

 'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
dtype='object')

1	print(Y_train)

0         1850
1         3600
2         6222
3         2400
4         5200
      ... 
149995    5900
149996    9500
149997    7500
149998    4999
149999    4700
Name: price, Length: 150000, dtype: int64

## 5)多变量互相关系回归关系可视化
fig,((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20)) # 生成5行2列十个子图
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = pd.concat([Y_train,Train_data["v_12"]], axis=1) # 合并成一列
#print(v_12_scatter_plot)
sns.regplot(x="v_12", y="price", data=v_12_scatter_plot,scatter=True,fit_reg=True,ax=ax1) # 数据与回归模型拟合

v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)
sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)

v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)

power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)

v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)

v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)

v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)

v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)

v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)

v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)
plt.show()

类别特征分析

nunique分布

1
2
3

## 1）nunique分布
for fea in categorical_features:
    print(Train_data[fea].nunique())

1	print(categorical_features)

['name', 
'model', 
'brand', 
'bodyType', 
'fuelType', 
'gearbox', 
'notRepairedDamage', 
'regionCode']

类别特征箱形图可视化

## 2)类别箱形图可视化
# 因为 name和 regionCode的类别太稀疏了，这里我们把不稀疏的几类画一下
categorical_features = ["model",
                        "brand",
                        "bodyType",
                        "fuelType",
                        "gearbox",
                        "notRepairedDamage"]
for c in categorical_features:
    Train_data[c] = Train_data[c].astype("category") # 强制转换数据类型
    if Train_data[c].isnull().any(): # 检查字段缺失
        Train_data[c] = Train_data[c].cat.add_categories(["MISSING"]) # 添加新类别
        Train_data[c] = Train_data[c].fillna("MISSING") # 填充为NAN的值
def boxplot(x, y, **kwargs):
    sns.boxplot(x=x, y=y) # 箱形图
    x=plt.xticks(rotation=90) #  设置坐标轴

f = pd.melt(Train_data, id_vars=["price"], value_vars=categorical_features)
g = sns.FacetGrid(f,col="variable", col_wrap=2, sharex=False,sharey=False,size=5)
g = g.map(boxplot, "value", "price")
plt.show()

类别特征的小提琴图可视化

1	print(Train_data.columns)

Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
       'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
      dtype='object')

## 3)类别特征的小提琴图可视化
catg_list = categorical_features
target = "price"
for catg in catg_list:
    sns.violinplot(x=catg,y=target,data=Train_data)
    plt.show()

类别特征的柱形图可视化

1	print(categorical_features)

['model', 
'brand', 
'bodyType', 
'fuelType', 
'gearbox', 
'notRepairedDamage']

## 4)类别特征的柱形图可视化
def bar_plot(x,y,**kwargs): # 柱形图
    sns.barplot(x=x,y=y)
    x=plt.xticks(rotation=90)
f = pd.melt(Train_data, id_vars=["price"], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",col_wrap=2,sharex=False,sharey=False,size=5)
g = g.map(bar_plot, "value", "price")
plt.show()

类别特征的每个类别频数可视化

## 5)类别特征的每个类别频数可视化
def count_plot(x,**kwargs): # 计数直方图
    sns.countplot(x=x)
    x=plt.xticks(rotation=90)
f = pd.melt(Train_data,value_vars=categorical_features)
g = sns.FacetGrid(f,col="variable", col_wrap=2,sharex=False,sharey=False,size=5)
g = g.map(count_plot,"value")
plt.show()

用pandas_profiling生成数据报告

## 生成数据报告
import pandas_profiling

pfr = pandas_profiling.ProfileReport(Train_data)
pfr.to_file("./example.html")

代码片段

# 导入warnings包，利用过滤器来实现忽略警告语句
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

## pd.set_option('display.max_columns', None)# 显示所有列
## pd.set_option('display.max_row', None)# 显示所有行
## 1)载入训练集和测试集
Train_data = pd.read_csv("./datalab/used_car_train_20200313.csv", sep = " ")
Test_data = pd.read_csv("./datalab/used_car_testA_20200313.csv", sep = " ")

## 2)简略观察数据（head()+shape)
#print(Train_data.head().append(Train_data.tail()))
#print(Train_data.shape)
#
# ## 3)通过describe()来熟悉相关统计量
# print(Train_data.describe())
#
# ## 4)通过info()来熟悉数据类型
# print(Train_data.info())
#
# ## 5)判断数据缺失和异常
# print(Train_data.isnull().sum())
#
#nan可视化
# missing = Train_data.isnull().sum()
# missing = missing[missing > 0]
# missing.sort_values(inplace=True) # 排序
# missing.plot.bar() # 绘柱状图
# plt.tight_layout() # 自动调整子图参数
# plt.show()
# # # 可视化看下缺省值
# msno.matrix(Train_data.sample(250))
# # plt.show()
# msno.bar(Train_data.sample(1000)) # 条形图
# plt.show()

## 6)查看异常值检测
# Train_data.info()
## print(Train_data["notRepairedDamage"].value_counts()) # 返回包含值和count
Train_data["notRepairedDamage"].replace("-", np.nan, inplace=True) # 将数据中‘-’替换成nan值
# print(Train_data.isnull().sum())

#print(Train_data["notRepairedDamage"].value_counts())
#Test_data.info()
##print(Test_data["notRepairedDamage"].value_counts())
#Test_data["notRepairedDamage"].replace("-", np.nan, inplace=True)
##print(Test_data["notRepairedDamage"].value_counts())

# 删除严重倾斜的数据
#print(Train_data["seller"].value_counts())
#print(Train_data["offerType"].value_counts())
# print(Test_data["seller"].value_counts())
# print(Test_data["offerType"].value_counts())

del Train_data["seller"]
del Train_data["offerType"]
# print(Train_data.info())
# print(Train_data.shape)
#del Test_data["seller"]
#del Test_data["offerType"]




# 了解预测值的分布
# print(Train_data["price"])
# print(Train_data["price"].value_counts())

## 1)总体分布情况(无界约翰逊分布等）
import scipy.stats as st
# y = Train_data["price"]
# plt.figure(1); plt.title("Johnson SU") # 创建新图
# sns.distplot(y, kde=False, fit=st.johnsonsu)
# plt.figure(2); plt.title("Normal")
# sns.distplot(y, kde=False, fit=st.norm)
# plt.figure(3); plt.title("Log Normal")
# sns.distplot(y, kde=False, fit=st.lognorm)
# plt.show() # 最佳拟合是无界约翰逊分布

## 2)查看skewness and kurtosis
# sns.distplot(Train_data["price"])
# print("Skewness: %f" % Train_data["price"].skew()) # 偏度
# print("Kurtosis: %f" % Train_data["price"].kurt()) # 峰度
# plt.show()

# print(Train_data.skew())
# print(Train_data.kurt())
# sns.distplot(Train_data.skew(), color="blue", axlabel="Skewness")
# plt.show()
# sns.distplot(Train_data.kurt(), color="orange", axlabel="Kurtness")
# plt.show()

# 3)查看预测值的具体频数
# plt.hist(Train_data["price"], orientation="vertical", histtype="bar", color="red")
# plt.show() # 直方图
# log变换之后的分布比较均匀，可以进行log变换进行预测，这也是预测问题常用的trick
# plt.hist(np.log(Train_data["price"]), orientation="vertical", histtype="bar", color="red")
# plt.show()



## 查看特征
# 分离label即预测值
Y_train = Train_data["price"]
## 这个区别方式适用于没有直接label coding的数据
## 这里不适用，需要人为根据实际含义来区分
## 数字特征
## numeric_features = Train_data.select_dtypes(include=[np.number])
## numeric_features.columns
## # 类型特征
## categorical_features = Train_data.select_dtypes(include=[np.object])
## categorical_features.columns



# 数字特征
numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]
# 类别特征
categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode']
## 类别特征nunique分布——Train_data
# for cat_fea in categorical_features:
#     print(cat_fea+"的特征分布如下：")
#     print("{}特征有{}个不同的值".format(cat_fea, Train_data[cat_fea].nunique()))
#     print(Train_data[cat_fea].value_counts())
## 类别特征nunique分布——Test_data
# for cat_fea in categorical_features:
#     print(cat_fea+"的特征分布如下：")
#     print("{}特征有{}个不同的值".format(cat_fea, Test_data[cat_fea].nunique()))
#     print(Test_data[cat_fea].value_counts())

## 数字特征分析
numeric_features.append("price")
# print(numeric_features)
#print(Train_data.head())
## 1)相关性分析
price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr() # 返回一个相关系数的矩阵
# print(correlation["price"].sort_values(ascending=False),"\n") # 降序排序

# f , ax = plt.subplots(figsize = (7, 7))
# plt.title("Correlation of Numeric Features with Price")
# sns.heatmap(correlation, square=True, vmax=0.8) # 热图（显示相关系数）
# plt.show()

## 2)查看几个特征的偏度和峰度
# for col in numeric_features:
#     print("{:15}".format(col),"Skewness:{:05.2f}".format(Train_data[col].skew()),
#     "   ",
#           "Kurtosis:{:06.2f}".format(Train_data[col].kurt()))

## 3)每个数字特征得分布可视化
# f = pd.melt(Train_data, value_vars=numeric_features) # 转换
# g = sns.FacetGrid(f,col="variable", col_wrap=2, sharex=False,sharey=False) # 以”variable“作“格子"绘图
# # plt.show()
# g = g.map(sns.distplot, "value") # 以”value“绘制到”格子”图中
# plt.show()

## 4）数字特征相互之间的关系可视化
# sns.set() # 风格设置
# colunms = ["price", "v_12", "v_8", "v_0", "power", "v_5", "v_2", "v_6", "v_1", "v_14"]
# sns.pairplot(Train_data[colunms],size=2, kind="scatter", diag_kind="kde") # 多变量图
# plt.show()

# print(Train_data.columns)
# print(Y_train)


## 5)多变量互相关系回归关系可视化
# fig,((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20)) # 生成5行2列十个子图
# # ['v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
# v_12_scatter_plot = pd.concat([Y_train,Train_data["v_12"]], axis=1) # 合并成一列
# #print(v_12_scatter_plot)
# sns.regplot(x="v_12", y="price", data=v_12_scatter_plot,scatter=True,fit_reg=True,ax=ax1) # 数据与回归模型拟合
#
# v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)
# sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)
#
# v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
# sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)
#
# power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
# sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)
#
# v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
# sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)
#
# v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
# sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)
#
# v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
# sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)
#
# v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
# sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)
#
# v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
# sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)
#
# v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
# sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)
# plt.show()

# 类别特征分析
## 1）nunique分布
# for fea in categorical_features:
#     print(Train_data[fea].nunique())
#
# print(categorical_features)

## 2)类别箱形图可视化
# 因为 name和 regionCode的类别太稀疏了，这里我们把不稀疏的几类画一下
categorical_features = ["model",
                        "brand",
                        "bodyType",
                        "fuelType",
                        "gearbox",
                        "notRepairedDamage"]
for c in categorical_features:
    Train_data[c] = Train_data[c].astype("category") # 强制转换数据类型
    if Train_data[c].isnull().any(): # 检查字段缺失
        Train_data[c] = Train_data[c].cat.add_categories(["MISSING"]) # 添加新类别
        Train_data[c] = Train_data[c].fillna("MISSING") # 填充为NAN的值
# def boxplot(x, y, **kwargs):
#     sns.boxplot(x=x, y=y) # 箱形图
#     x=plt.xticks(rotation=90) #  设置坐标轴
#
# f = pd.melt(Train_data, id_vars=["price"], value_vars=categorical_features)
# g = sns.FacetGrid(f,col="variable", col_wrap=2, sharex=False,sharey=False,size=5)
# g = g.map(boxplot, "value", "price")
# plt.show()

## 3)类别特征的小提琴图可视化
#print(Train_data.columns)
# catg_list = categorical_features
# target = "price"
# for catg in catg_list:
#     sns.violinplot(x=catg,y=target,data=Train_data)
#     plt.show()

# print(categorical_features)

## 4)类别特征的柱形图可视化
# def bar_plot(x,y,**kwargs): # 柱形图
#     sns.barplot(x=x,y=y)
#     x=plt.xticks(rotation=90)
# f = pd.melt(Train_data, id_vars=["price"], value_vars=categorical_features)
# g = sns.FacetGrid(f, col="variable",col_wrap=2,sharex=False,sharey=False,size=5)
# g = g.map(bar_plot, "value", "price")
# plt.show()

## 5)类别特征的每个类别频数可视化
# def count_plot(x,**kwargs): # 计数直方图
#     sns.countplot(x=x)
#     x=plt.xticks(rotation=90)
# f = pd.melt(Train_data,value_vars=categorical_features)
# g = sns.FacetGrid(f,col="variable", col_wrap=2,sharex=False,sharey=False,size=5)
# g = g.map(count_plot,"value")
# plt.show()

## 生成数据报告
import pandas_profiling
#
# pfr = pandas_profiling.ProfileReport(Train_data)
# pfr.to_file("./example.html")

经验总结

所给出的EDA步骤为广为普遍的步骤，在实际的不管是工程还是比赛过程中，这只是最开始的一步，也是最基本的一步。

接下来一般要结合模型的效果以及特征工程等来分析数据的实际建模情况，根据自己的一些理解，查阅文献，对实际问题做出判断和深入的理解。

最后不断进行EDA与数据处理和挖掘，来到达更好的数据结构和分布以及较为强势相关的特征

数据探索在机器学习中我们一般称为EDA（Exploratory Data Analysis）：

是指对已有的数据（特别是调查或观察得来的原始数据）在尽量少的先验假定下进行探索，通过作图、制表、方程拟合、计算特征量等手段探索数据的结构和规律的一种数据分析方>法。

数据探索有利于我们发现数据的一些特性，数据之间的关联性，对于后续的特征构建是很有帮助的。

对于数据的初步分析（直接查看数据，或.sum(), .mean()，.descirbe()等统计函数）可以从：样本数量，训练集数量，是否有时间特征，是否是时许问题，特征所表示的含义（非匿名特征），特征类型（字符类似，int，float，time），特征的缺失情况（注意缺失的在数据中的表现形式，有些是空的有些是”NAN”符号等），特征的均值方差情况。
分析记录某些特征值缺失占比30%以上样本的缺失处理，有助于后续的模型验证和调节，分析特征应该是填充（填充方式是什么，均值填充，0填充，众数填充等），还是舍去，还是先做样本分类用不同的特征模型去预测。
对于异常值做专门的分析，分析特征异常的label是否为异常值（或者偏离均值较远或者是特殊符号）,异常值是否应该剔除，还是用正常值填充，是记录异常，还是机器本身异常等。
对于Label做专门的分析，分析标签的分布情况等。
进步分析可以通过对特征作图，特征和label联合做图（统计图，离散图），直观了解特征的分布情况，通过这一步也可以发现数据之中的一些异常值等，通过箱型图分析一些特征值的偏离情况，对于特征和特征联合作图，对于特征和label联合作图，分析其中的一些关联性。

Datawhale零基础入门数据挖掘-Task1

发表于 2020-03-21 更新于 2020-03-23 分类于数据挖掘及机器学习阅读次数：
本文字数： 6.5k 阅读时长 ≈ 6 分钟

学习背景:由Datawhale与天池开放的零基础入门数据挖掘赛事-二手车交易价格预测
赛题概括:赛题以预测二手车的交易价格为任务，数据集报名后可见并可下载，该数据来自某交易平台的二手车交易记录，总数据量超过40w，包含31列变量信息，其中15列为匿名变量。为了保证比赛的公平性，将会从中抽取15万条作为训练集，5万条作为测试集A，5万条作为测试集B，同时会对name、model、brand和regionCode等信息进行脱敏。

赛题分析

数据概括

一般而言，对于数据在比赛界面都有对应的数据概况介绍（匿名特征除外），说明列的性质特征。了解列的性质会有助于我们对于数据的理解和后续分析。 Tip:匿名特征，就是未告知数据列所属的性质的特征列。

train.csv

SaleID - 销售样本ID

name - 汽车编码

regDate - 汽车注册时间

model - 车型编码

brand - 品牌

bodyType - 车身类型

fuelType - 燃油类型

gearbox - 变速箱

power - 汽车功率

kilometer - 汽车行驶公里

notRepairedDamage - 汽车有尚未修复的损坏

regionCode - 看车地区编码

seller - 销售方

offerType - 报价类型

creatDate - 广告发布时间

price - 汽车价格

‘v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,’v_14’ 【匿名特征，包含v0-14在内15个匿名特征】　

评测标准

赛题评价目标为MAE(Mean Absolute Error):

MAE越小，说明模型预测得越准确

预测建模

预测建模就是使用历史数据建立一个模型，去给没有答案的新数据做预测的问题

关于预测建模，可以在下面这篇文章中了解更多信息:

Gentle Introduction to Predictive Modeling: https://machinelearningmastery.com/gentle-introduction-to-predictive-modeling/

预测建模可以被描述成一个近似求取从输入变量（X）到输出变量（y）的映射函数的数学问题。这被称为函数逼近问题

建模算法的任务就是在给定的可用时间和资源的限制下，去寻找最佳映射函数。更多关于机器学习中应用逼近函数的内容，请参阅下面这篇文章：

机器学习是如何运行的（how machine learning work,https://machinelearningmastery.com/how-machine-learning-algorithms-work/)

一般而言，我们可以将函数逼近任务划分为分类任务和回归任务

分类预测建模

分类预测建模是逼近一个从输入变量（X）到离散的输出变量（y）之间的映射函数（f）

输出变量经常被称作标签或者类别。映射函数会对一个给定的观察样本预测一个类别标签

例如，一个文本邮件可以被归为两类：「垃圾邮件」，和「非垃圾邮件」

分类问题需要把样本分为两类或者多类
分类的输入可以是实数也可以有离散变量
只有两个类别的分类问题经常被称作两类问题或者二元分类问题
具有多于两类的问题经常被称作多分类问题
样本属于多个类别的问题被称作多标签分类问题

分类模型经常为输入样本预测得到与每一类别对应的像概率一样的连续值。这些概率可以被解释为样本属于每个类别的似然度或者置信度。预测到的概率可以通过选择概率最高的类别转换成类别标签

例如，某封邮件可能以 0.1 的概率被分为「垃圾邮件」，以 0.9 的概率被分为「非垃圾邮件」。因为非垃圾邮件的标签的概率最大，所以我们可以将概率转换成「非垃圾邮件」的标签

有很多用来衡量分类预测模型的性能的指标，但是分类准确率可能是最常用的一个

例如，如果一个分类预测模型做了 5 个预测，其中有 3 个是正确的，2 个这是错误的，那么这个模型的准确率就是 60%：

accuracy = correct predictions / total predictions * 100
accuracy = 3 / 5 * 100
accuracy = 60%

能够学习分类模型的算法就叫做分类算法

回归预测模型

回归预测建模是逼近一个从输入变量（X）到连续的输出变量（y）的函数映射

连续输出变量是一个实数，例如一个整数或者浮点数。这些变量通常是数量或者尺寸大小等等

例如，一座房子可能被预测到以 xx 美元出售，也许会在 $100,000 t 到$200,000 的范围内

回归问题需要预测一个数量
回归的输入变量可以是连续的也可以是离散的
有多个输入变量的通常被称作多变量回归
输入变量是按照时间顺序的回归称为时间序列预测问题
因为回归预测问题预测的是一个数量，所以模型的性能可以用预测结果中的错误来评价

有很多评价回归预测模型的方式，但是最常用的一个可能是计算误差值的均方根，即 RMSE

例如，如果回归预测模型做出了两个预测结果，一个是 1.5，对应的期望结果是 1.0；另一个是 3.3 对应的期望结果是 3.0. 那么，这两个回归预测的 RMSE 如下：

RMSE = sqrt(average(error^2))
RMSE = sqrt(((1.0 - 1.5)^2 + (3.0 - 3.3)^2) / 2)
RMSE = sqrt((0.25 + 0.09) / 2)
RMSE = sqrt(0.17)
RMSE = 0.412

使用 RMSE 的好处就是错误评分的单位与预测结果是一样的

一个能够学习回归预测模型的算法称作回归算法

有些算法的名字也有「regression,回归」一词，例如线性回归和 logistics 回归，这种情况有时候会让人迷惑因为线性回归确实是一个回归问题，但是 logistics 回归却是一个分类问题

分类 vs 回归

分类预测建模问题与回归预测建模问题是不一样的

分类是预测一个离散标签的任务
回归是预测一个连续数量的任务

分类和回归也有一些相同的地方：

分类算法可能预测到一个连续的值，但是这些连续值对应的是一个类别的概率的形式
回归算法可以预测离散值，但是以整型量的形式预测离散值的

有些算法既可以用来分类，也可以稍作修改就用来做回归问题，例如决策树和人工神经网络。但是一些算法就不行了——或者说是不太容易用于这两种类型的问题，例如线性回归是用来做回归预测建模的，logistics 回归是用来做分类预测建模的

重要的是，我们评价分类模型和预测模型的方式是不一样的，例如：

分类预测可以使用准确率来评价，而回归问题则不能
回归预测可以使用均方根误差来评价，但是分类问题则不能

分类问题和回归问题之间的转换

在一些情况中是可以将回归问题转换成分类问题的。例如，被预测的数量是可以被转换成离散数值的范围的

例如，在$0 到$100 之间的金额可以被分为两个区间：

class 0：$0 到$49
class 1: $50 到$100

这通常被称作离散化，结果中的输出变量是一个分类，分类的标签是有顺序的（称为叙序数）

在一些情况中，分类是可以转换成回归问题的。例如，一个标签可以被转换成一个连续的范围

一些算法早已通过为每一个类别预测一个概率，这个概率反过来又可以被扩展到一个特定的数值范围：

quantity = min + probability * range

与此对应，一个类别值也可以被序数化，并且映射到一个连续的范围中：

$0 到 $49 是类别 1
$0 到 $49 是类别 2

如果分类问题中的类别标签没有自然顺序的关系，那么从分类问题到回归问题的转换也许会导致奇怪的结果或者很差的性能，因为模型可能学到一个并不存在于从输入到连续输出之间的映射函数

原文链接https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/

关于评价指标

评估指标即是我们对于一个模型效果的数值型量化。（有点类似与对于一个商品评价打分，而这是针对于模型效果和理想效果之间的一个打分）

一般来说分类和回归问题的评价指标有如下一些形式：

分类算法常见的评估指标如下：

对于二类分类器/分类算法，评价指标主要有accuracy， Precision，Recall，F-score，Pr曲线，ROC-AUC曲线

对于多类分类器/分类算法，评价指标主要有accuracy，宏平均和微平均，F-score

对于回归预测类常见的评估指标如下:

平均绝对误差（Mean Absolute Error，MAE），均方误差（Mean Squared Error，MSE），平均绝对百分误差（Mean Absolute Percentage Error，MAPE），均方根误差（Root Mean Squared Error）， R2（R-Square）

平均绝对误差

平均绝对误差（Mean Absolute Error，MAE）:其能更好地反映预测值与真实值误差的实际情况，其计算公式如下：
$$MAE=\frac{1}{N} \sum_{i=1}^{N}\left|y_{i}-\hat{y}_{i}\right|$$

均方误差

均方误差（Mean Squared Error，MSE）,均方误差,其计算公式为：
$$MSE=\frac{1}{N} \sum_{i}^{N}\left(y_{i}-\hat{y}_{i}\right)^{2}$$

R2（R-Square）

残差平方和:
$$SS_{res}=\sum\left(y_{i}-\hat{y}_{i}\right)^{2}$$
总平均值:
$$SS_{tot}=\sum\left(y_{i}-\overline{y}_{i}\right)^{2}$$
其中$\overline{y}$表示$y$的平均值得到$R^2$的表达式为:

$$R^{2}=1-\frac{SS_{res}}{SS_{tot}}$$

$R^2$用于度量因变量的变异中可由自变量解释部分所占的比例，取值范围是 0~1，$R^2$越接近1,表明回归平方和占总平方和的比例越大,回归线与各观测点越接近，用x的变化来解释y值变化的部分就越多,回归的拟合程度就越好。所以$R^2$也称为拟合优度（Goodness of Fit）的统计量

$y_{i}$表示真实值,

$\hat{y}_{i}$表示预测值,

$\overline{y}_{i}$表示样本均值。得分越高拟合效果越好

几何解释

上图红色点是incoming自变量与Consuming因变量对应的散点图，蓝色线是回归方程线（最小二乘法得到）；
这里红色点$y_{i}$表示一个响应观测值点（共4个），蓝色点$f_{i}$是响应观测值对应的回归曲线上的点，两个的差值就是残差，残差值共有4个,$\overline{y}$是响应变量的平均值。

根据平方和分解公式:

即：SS 总体=SS 回归 + SS 残差 (观测值与平均值的差值平方和被残差平方和以及回归差值平方和之和解释)

分析结果

此题为传统的数据挖掘问题，通过数据科学以及机器学习深度学习的办法来进行建模得到结果。
此题是一个典型的回归问题。
主要应用xgb、lgb、catboost，以及pandas、numpy、matplotlib、seabon、sklearn、keras等等数据挖掘常用库或者框架来进行数据挖掘任务。
通过EDA来挖掘数据的联系和自我熟悉数据

代码示例

import pandas as pd

# 1) 载入训练集和测试集
Train_data = pd.read_csv('./datalab/used_car_train_20200313.csv',sep=(' '))
Test_data = pd.read_csv('./datalab/used_car_testA_20200313.csv',sep=(' '))

print(Train_data.shape) # 返回行,列数
print(Test_data.shape)
out_put = Train_data.head(4) # 返回前四行字段数据
print(out_put)

# 2)分类指标评价计算

## accuracy
from sklearn.metrics import accuracy_score
y_pred = [0, 1, 0, 1] # 预测标签
y_true = [0, 1, 1, 1] # 正确标签
print('ACC:',accuracy_score(y_true, y_pred)) # 返回正确样本所占比例（float）


"""
TP：真正例：即将正样本预测为正样本
TN：真反例：即将负样本预测为负样本
FP：假正例：将负样本预测为了正样本
FN：假反例：将正样本预测为了负样本
"""
## Percision,Recall,F1-score
from sklearn import metrics
y_pred = [0, 1, 0, 0]
y_true = [0, 1, 0, 1]
print('Precision',metrics.precision_score(y_true, y_pred)) # 返回 TP / (TP + FP)
print('Recall',metrics.recall_score(y_true, y_pred)) # 返回 TP / (TP + FN)
print('F1-score:',metrics.f1_score(y_true, y_pred)) # 返回 2*(P*R)/(P+R)

## AUC
import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1]) # True labels or binary label indicators
y_scores = np.array([0.1, 0.4, 0.35, 0.8]) # Target scores
print('AUC socre:',roc_auc_score(y_true, y_scores))

# coding=utf-8
# import numpy as np
# from sklearn import metrics

# MAPE需要自己实现
def mape(y_true, y_pred):
    return np.mean(np.abs((y_pred - y_true) / y_true))

y_true = np.array([1.0, 5.0, 4.0, 3.0, 2.0, 5.0, -3.0])
y_pred = np.array([1.0, 4.5, 3.8, 3.2, 3.0, 4.8, -2.2])

# MSE
print('MSE:',metrics.mean_squared_error(y_true, y_pred))
# RMSE
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_true, y_pred)))
# MAE
print('MAE:',metrics.mean_absolute_error(y_true, y_pred))
# MAPE
print('MAPE:',mape(y_true, y_pred))

## R2-score
from sklearn.metrics import r2_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print('R2-score:',r2_score(y_true, y_pred))

如何使用github创建博客

发表于 2020-03-19 更新于 2020-03-21 分类于笔记阅读次数：
本文字数： 990 阅读时长 ≈ 1 分钟

-利用 Github 搭建博客需要熟悉git方便管理.操作如果对git感兴趣请参考

搭建环境

安装 node

因为 hexo 是基于 node 框架的,先下载安装 node ,查看node -v版本,没有的话就根据提示操作

安装 npm

安装 nodejs 肯定要安装 npm ,Ubuntu下载可能会很慢,建议换成国内源,参考Ubuntu apt-get和pip源更换

初始化 blog

安装 hexo ,在终端中输入:npm install hexo-cli -g(参考Hexo文档)
初始化 blog 目录:hexo init happybear1234.github.io(这里的 happybear1234 换成你自己的英文名,我这里就是github的用户名)
初始化之后,进入到 blog 目录下:cd happybear1234.github.io(以后对博客的所以操作都是在这)
安装npm install
clean一下:hexo clean
生成静态页面:hexo g
运行起来:hexo s

打开浏览器,输入终端里网址 localhost:4000 就能看到了(如果提示服务端口被占用,可以换个端口,hexo server -p 5000)

选一个Hexo主题

这里提供知乎答主们推荐的hexo主题大全,刚开始为了熟悉各种配置建议使用 NexT 主题,因为文档比较详细,界面也很简洁,如果安装 NexT 主题和配置可以参考文档

部署到网上

现在的 blog 只能自己本地访问,可以使用 Github Pages 免费部署

创建仓库

创建一个 xxx.github.io 的 public 仓库,这里 xxx 写你的名字,我这里写的 happybear1234.github.io,那么之后我就可以用 happybear1234.github.io 来访问了

安装 hexo-deployer-git

在 blog 目录下输入下面命令,这样本地的文章才能 push 到 Github 上面去
npm install hexo-deployer-git --save

配置Git

打开 blog 目录下配置文件:vi _config.yml,输入你的 git 地址:
1
2
3
deploy:
type: git
repo: https://github.com/xxx/xxxx.github.io.git
推送网站到 Github 上
直接在 blog 目录下输入:hexo d
push 上去以后你就可以输入 xxx.github.io 进行访问啦

载入数据,预处理

sklearn库

十折交叉训练

留一法

公式说明

sklearn库实现

载入数据,预处理

sklearn逻辑回归库拟合

绘制决策边界

梯度下降法实现

实现以上公式

绘制收敛曲线

绘制决策边界

测试

相关原理

线性回归模型

决策树模型

GBDT模型

XGBoost模型

LightGBM模型

教材推荐

读取数据

线性回归 & 五折交叉验证 & 模拟真实业务情况

简单建模

五折交叉验证

模拟真实业务情况

绘制学习率曲线与验证曲线

多种模型对比

线性模型 & 嵌入式特征选择

非线性模型

常见的特征工程包括

导入数据

删除异常值

特征构造

数据分桶

导出数据

特征构造

归一化

导出数据

特征筛选

过滤式

包裹式

嵌入式

代码片段

经验总结

载入各种数据科学以及可视化库

载入各种数据科学以及可视化库

载入数据

简略观察数据

总览数据概况

通过describe()来熟悉相关统计量

通过info()来熟悉数据类型

判断数据缺失和异常

查看每列的存在nan情况

nan可视化

可视化看下缺省值

查看异常值检测

了解预测值的分布

总体分布情况(无界约翰逊分布等）

查看skewness and kurtosis

查看预测值的具体频数

特征分为类别特征和数字特征，并对类别特征查看nunique分布

数据类型

数字特征分析

相关性分析

查看几个特征的偏度和峰度

每个数字特征得分布可视化

数字特征相互之间的关系可视化

多变量互相回归关系可视化

类别特征分析

nunique分布

类别特征箱形图可视化

类别特征的小提琴图可视化

类别特征的柱形图可视化

类别特征的每个类别频数可视化

用pandas_profiling生成数据报告

代码片段

经验总结

最后不断进行EDA与数据处理和挖掘，来到达更好的数据结构和分布以及较为强势相关的特征

赛题分析