0%

Datawhale零基础入门数据挖掘-Task2

  • EDA的价值主要在于熟悉数据集,了解数据集,对数据集进行验证来确定所获得数据集可以用于接下来的机器学习或者深度学习使用
  • 当了解了数据集之后我们下一步就是要去了解变量间的相互关系以及变量与预测值之间的存在关系
  • 进行数据处理以及特征工程,使数据集的结构和特征集让接下来的预测问题更加可靠

载入各种数据科学以及可视化库

载入各种数据科学以及可视化库

1
2
3
4
5
6
7
8
9
# 导入warnings包,利用过滤器来实现忽略警告语句
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

载入数据

1
2
3
4
5
## pd.set_option('display.max_columns', None)# 显示所有列
## pd.set_option('display.max_row', None)# 显示所有行
## 1)载入训练集和测试集
Train_data = pd.read_csv("./datalab/used_car_train_20200313.csv", sep = " ")
Test_data = pd.read_csv("./datalab/used_car_testA_20200313.csv", sep = " ")
  • 以下主要以Train_data为例

    简略观察数据

    1
    2
    ## 2)简略观察数据(head()+shape)
    print(Train_data.head().append(Train_data.tail()))
SaleID name regDate model v_11 v_12 v_13 v_14
0 0 736 20040402 30.0 2.804097 -2.420821 0.795292 0.914762
1 1 2262 20030301 40.0 2.096338 -1.030483 -1.722674 0.245522
2 2 14874 20040403 115.0 1.803559 1.565330 -0.832687 -0.229963
3 3 71865 19960908 109.0 1.285940 -0.501868 -2.438353 -0.478699
4 4 111080 20120103 110.0 0.910783 0.931110 2.834518 1.923482
149995 149995 163978 20000607 121.0 -2.983973 0.589167 -1.304370 -0.302592
149996 149996 184535 20091102 116.0 -2.774615 2.553994 0.924196 -0.272160
149997 149997 147587 20101003 60.0 -1.630677 2.290197 1.891922 0.414931
149998 149998 45907 20060312 34.0 -2.633719 1.414937 0.431981 -1.659014
149999 149999 177672 19990204 19.0 -3.179913 0.031724 -1.483350 -0.342674

[10 rows x 31 columns]

1
print(Train_data.shape)
(150000, 31)

总览数据概况

  1. describe种有每列的统计量,个数count、平均值mean、方差std、最小值min、中位数25% 50% 75% 、以及最大值 看这个信息主要是瞬间掌握数据的大概的范围以及每个值的异常值的判断,比如有的时候会发现999 9999 -1 等值这些其实都是nan的另外一种表达方式,有的时候需要注意下
  2. info 通过info来了解数据每列的type,有助于了解是否存在除了nan以外的特殊符号异常

通过describe()来熟悉相关统计量

1
2
## 3)通过describe()来熟悉相关统计量
print(Train_data.describe())
SaleID name v_13 v_14
count 150000.000000 150000.000000 150000.000000 150000.000000
mean 74999.500000 68349.172873 0.000313 -0.000688
std 43301.414527 61103.875095 1.288988 1.038685
min 0.000000 0.000000 -4.153899 -6.546556
25% 37499.750000 11156.000000 -1.057789 -0.437034
50% 74999.500000 51638.000000 -0.036245 0.141246
75% 112499.250000 118841.250000 0.942813 0.680378
max 149999.000000 196812.000000 11.147669 8.658418

[8 rows x 30 columns]

通过info()来熟悉数据类型

1
2
## 4)通过info()来熟悉数据类型
print(Train_data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_0                150000 non-null  float64
 17  v_1                150000 non-null  float64
 18  v_2                150000 non-null  float64
 19  v_3                150000 non-null  float64
 20  v_4                150000 non-null  float64
 21  v_5                150000 non-null  float64
 22  v_6                150000 non-null  float64
 23  v_7                150000 non-null  float64
 24  v_8                150000 non-null  float64
 25  v_9                150000 non-null  float64
 26  v_10               150000 non-null  float64
 27  v_11               150000 non-null  float64
 28  v_12               150000 non-null  float64
 29  v_13               150000 non-null  float64
 30  v_14               150000 non-null  float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
None

判断数据缺失和异常

查看每列的存在nan情况

1
2
## 5) 查看每列的存在nan情况
print(Train_data.isnull().sum())
SaleID                  0
name                    0
regDate                 0
model                   1
brand                   0
bodyType             4506
fuelType             8680
gearbox              5981
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
price                   0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64

nan可视化

1
2
3
4
5
6
7
#nan可视化
missing = Train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True) # 排序
missing.plot.bar() # 绘柱状图
plt.tight_layout() # 自动调整子图参数
plt.show()

1

  • 通过以上可以很直观的了解哪些列存在 “nan”, 并可以把nan的个数打印,主要的目的在于 nan存在的个数是否真的很大,如果很小一般选择填充,如果使用lgb等树模型可以直接空缺,让树自己去优化,但如果nan存在的过多、可以考虑删掉

可视化看下缺省值

1
2
3
## 可视化看下缺省值
msno.matrix(Train_data.sample(250))
plt.show()

2

1
2
msno.bar(Train_data.sample(1000)) # 条形图
plt.show()

3

查看异常值检测

  • 通过前面info()来熟悉数据类型,可以发现除了notRepairedDamage 为object类型其他都为数字 这里我们把他的几个不同的值都进行显示就知道了

    1
    print(Train_data["notRepairedDamage"].value_counts()) # 返回包含值和count

    0.0 111361
    - 24324
    1.0 14315
    Name: notRepairedDamage, dtype: int64

  • 可以看出来‘ - ’也为空缺值,因为很多模型对nan有直接的处理,这里我们先不做处理,先替换成nan

1
2
Train_data["notRepairedDamage"].replace("-", np.nan, inplace=True) # 将数据中‘-’替换成nan值
print(Train_data["notRepairedDamage"].value_counts())
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64
  • 再查看nan值情况
1
print(Train_data.isnull().sum())
SaleID                   0
name                     0
regDate                  0
model                    1
brand                    0
bodyType              4506
fuelType              8680
gearbox               5981
power                    0
kilometer                0
notRepairedDamage    24324
regionCode               0
seller                   0
offerType                0
creatDate                0
price                    0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
dtype: int64
  • 以下两个类别特征严重倾斜,一般不会对预测有什么帮助,故这边先删掉,当然你也可以继续挖掘,但是一般意义不大
1
print(Train_data["seller"].value_counts())
0    149999
1         1
Name: seller, dtype: int64
1
print(Train_data["offerType"].value_counts())
0    150000
Name: offerType, dtype: int64
1
2
3
4
5
# 删除严重倾斜的数据
del Train_data["seller"]
del Train_data["offerType"]
print(Train_data.info())
print(Train_data.shape)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 29 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  creatDate          150000 non-null  int64  
 13  price              150000 non-null  int64  
 14  v_0                150000 non-null  float64
 15  v_1                150000 non-null  float64
 16  v_2                150000 non-null  float64
 17  v_3                150000 non-null  float64
 18  v_4                150000 non-null  float64
 19  v_5                150000 non-null  float64
 20  v_6                150000 non-null  float64
 21  v_7                150000 non-null  float64
 22  v_8                150000 non-null  float64
 23  v_9                150000 non-null  float64
 24  v_10               150000 non-null  float64
 25  v_11               150000 non-null  float64
 26  v_12               150000 non-null  float64
 27  v_13               150000 non-null  float64
 28  v_14               150000 non-null  float64
dtypes: float64(20), int64(8), object(1)
memory usage: 33.2+ MB
None

(150000, 29)

了解预测值的分布

1
2
print(Train_data["price"])
print(Train_data["price"].value_counts())
0         1850
1         3600
2         6222
3         2400
4         5200
      ... 
149995    5900
149996    9500
149997    7500
149998    4999
149999    4700
Name: price, Length: 150000, dtype: int64

500      2337
1500     2158
1200     1922
1000     1850
2500     1821
     ... 
25321       1
8886        1
8801        1
37920       1
8188        1
Name: price, Length: 3763, dtype: int64

总体分布情况(无界约翰逊分布等)

1
2
3
4
5
6
7
8
9
10
## 1)总体分布情况(无界约翰逊分布等)
import scipy.stats as st
y = Train_data["price"]
plt.figure(1); plt.title("Johnson SU") # 创建新图
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title("Normal")
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title("Log Normal")
sns.distplot(y, kde=False, fit=st.lognorm)
plt.show() # 最佳拟合是无界约翰逊分布

4
5
6

  • 价格不服从正态分布,所以在进行回归之前,它必须进行转换。虽然对数变换做得很好,但最佳拟合是无界约翰逊分布

查看skewness and kurtosis

1
2
3
4
5
## 2)查看skewness and kurtosis
sns.distplot(Train_data["price"])
print("Skewness: %f" % Train_data["price"].skew()) # 偏度
print("Kurtosis: %f" % Train_data["price"].kurt()) # 峰度
plt.show()
Skewness: 3.346487
Kurtosis: 18.995183

7

1
2
print(Train_data.skew())
print(Train_data.kurt())
SaleID               6.017846e-17
name                 5.576058e-01
regDate              2.849508e-02
model                1.484388e+00
brand                1.150760e+00
bodyType             9.915299e-01
fuelType             1.595486e+00
gearbox              1.317514e+00
power                6.586318e+01
kilometer           -1.525921e+00
notRepairedDamage    2.430640e+00
regionCode           6.888812e-01
creatDate           -7.901331e+01
price                3.346487e+00
v_0                 -1.316712e+00
v_1                  3.594543e-01
v_2                  4.842556e+00
v_3                  1.062920e-01
v_4                  3.679890e-01
v_5                 -4.737094e+00
v_6                  3.680730e-01
v_7                  5.130233e+00
v_8                  2.046133e-01
v_9                  4.195007e-01
v_10                 2.522046e-02
v_11                 3.029146e+00
v_12                 3.653576e-01
v_13                 2.679152e-01
v_14                -1.186355e+00
dtype: float64

SaleID                 -1.200000
name                   -1.039945
regDate                -0.697308
model                   1.740483
brand                   1.076201
bodyType                0.206937
fuelType                5.880049
gearbox                -0.264161
power                5733.451054
kilometer               1.141934
notRepairedDamage       3.908072
regionCode             -0.340832
creatDate            6881.080328
price                  18.995183
v_0                     3.993841
v_1                    -1.753017
v_2                    23.860591
v_3                    -0.418006
v_4                    -0.197295
v_5                    22.934081
v_6                    -1.742567
v_7                    25.845489
v_8                    -0.636225
v_9                    -0.321491
v_10                   -0.577935
v_11                   12.568731
v_12                    0.268937
v_13                   -0.438274
v_14                    2.393526
dtype: float64
1
2
sns.distplot(Train_data.skew(), color="blue", axlabel="Skewness")
plt.show()

8

1
2
sns.distplot(Train_data.kurt(), color="orange", axlabel="Kurtness")
plt.show()

9

查看预测值的具体频数

1
2
3
# 3)查看预测值的具体频数
plt.hist(Train_data["price"], orientation="vertical", histtype="bar", color="red")
plt.show() # 直方图

10

  • 查看频数, 大于20000得值极少,其实这里也可以把这些当作特殊得值(异常值)直接用填充或者删掉,在前面进行
1
2
3
# log变换之后的分布比较均匀,可以进行log变换进行预测,这也是预测问题常用的trick
plt.hist(np.log(Train_data["price"]), orientation="vertical", histtype="bar", color="red")
plt.show()

11

  • log变换之后的分布较均匀,可以进行log变换进行预测,这也是预测问题常用的trick

特征分为类别特征和数字特征,并对类别特征查看nunique分布

数据类型

  • name - 汽车编码
  • regDate - 汽车注册时间
  • model - 车型编码
  • brand - 品牌
  • bodyType - 车身类型
  • fuelType - 燃油类型
  • gearbox - 变速箱
  • power - 汽车功率
  • kilometer - 汽车行驶公里
  • notRepairedDamage - 汽车有尚未修复的损坏
  • regionCode - 看车地区编码
  • seller - 销售方 【以删】
  • offerType - 报价类型 【以删】
  • creatDate - 广告发布时间
  • price - 汽车价格
  • v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,’v_14’【匿名特征,包含v0-14在内15个匿名特征】
1
2
# 分离label即预测值
Y_train = Train_data['price']
1
2
3
4
5
6
7
8
# 这个区别方式适用于没有直接label coding的数据
# 这里不适用,需要人为根据实际含义来区分
# 数字特征
# numeric_features = Train_data.select_dtypes(include=[np.number])
# numeric_features.columns
# # 类型特征
# categorical_features = Train_data.select_dtypes(include=[np.object])
# categorical_features.columns
1
2
3
4
# 数字特征
#numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]
# 类别特征
#categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode']
1
2
3
4
5
## 类别特征nunique分布——Train_data
for cat_fea in categorical_features:
print(cat_fea+"的特征分布如下:")
print("{}特征有{}个不同的值".format(cat_fea, Train_data[cat_fea].nunique()))
print(Train_data[cat_fea].value_counts())
name的特征分布如下:
name特征有99662个不同的值
708       282
387       282
55        280
1541      263
203       233
     ... 
5074        1
7123        1
11221       1
13270       1
174485      1
Name: name, Length: 99662, dtype: int64

model的特征分布如下:
model特征有248个不同的值
0.0      11762
19.0      9573
4.0       8445
1.0       6038
29.0      5186
     ...  
245.0        2
209.0        2
240.0        2
242.0        2
247.0        1
Name: model, Length: 248, dtype: int64

brand的特征分布如下:
brand特征有40个不同的值
0     31480
4     16737
14    16089
10    14249
1     13794
6     10217
9      7306
5      4665
13     3817
11     2945
3      2461
7      2361
16     2223
8      2077
25     2064
27     2053
21     1547
15     1458
19     1388
20     1236
12     1109
22     1085
26      966
30      940
17      913
24      772
28      649
32      592
29      406
37      333
2       321
31      318
18      316
36      228
34      227
33      218
23      186
35      180
38       65
39        9
Name: brand, dtype: int64

bodyType的特征分布如下:
bodyType特征有8个不同的值
0.0    41420
1.0    35272
2.0    30324
3.0    13491
4.0     9609
5.0     7607
6.0     6482
7.0     1289
Name: bodyType, dtype: int64

fuelType的特征分布如下:
fuelType特征有7个不同的值
0.0    91656
1.0    46991
2.0     2212
3.0      262
4.0      118
5.0       45
6.0       36
Name: fuelType, dtype: int64

gearbox的特征分布如下:
gearbox特征有2个不同的值
0.0    111623
1.0     32396
Name: gearbox, dtype: int64

notRepairedDamage的特征分布如下:
notRepairedDamage特征有2个不同的值
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64

regionCode的特征分布如下:
regionCode特征有7905个不同的值
419     369
764     258
125     137
176     136
462     134
       ... 
6414      1
7063      1
4239      1
5931      1
7267      1
Name: regionCode, Length: 7905, dtype: int64

数字特征分析

1
2
numeric_features.append("price")
print(numeric_features)
['power', 
'kilometer', 
'v_0', 
'v_1', 
'v_2', 
'v_3', 
'v_4', 
'v_5', 
'v_6', 
'v_7', 
'v_8', 
'v_9', 
'v_10', 
'v_11', 
'v_12', 
'v_13', 
'v_14', 
'price']
1
print(Train_data.head())
SaleID name regDate model v_11 v_12 v_13 v_14
0 0 736 20040402 30.0 2.804097 -2.420821 0.795292 0.914762
1 1 2262 20030301 40.0 2.096338 -1.030483 -1.722674 0.245522
2 2 14874 20040403 115.0 1.803559 1.565330 -0.832687 -0.229963
3 3 71865 19960908 109.0 1.285940 -0.501868 -2.438353 -0.478699
4 4 111080 20120103 110.0 0.910783 0.931110 2.834518 1.923482

[5 rows x 29 columns]

相关性分析

1
2
3
4
## 1)相关性分析
price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr() # 返回一个相关系数的矩阵
print(correlation["price"].sort_values(ascending=False),"\n") # 降序排序
price        1.000000
v_12         0.692823
v_8          0.685798
v_0          0.628397
power        0.219834
v_5          0.164317
v_2          0.085322
v_6          0.068970
v_1          0.060914
v_14         0.035911
v_13        -0.013993
v_7         -0.053024
v_4         -0.147085
v_9         -0.206205
v_10        -0.246175
v_11        -0.275320
kilometer   -0.440519
v_3         -0.730946
Name: price, dtype: float64 
1
2
3
4
f , ax = plt.subplots(figsize = (7, 7))
plt.title("Correlation of Numeric Features with Price")
sns.heatmap(correlation, square=True, vmax=0.8) # 热图(显示相关系数)
plt.show()

12

查看几个特征的偏度和峰度

1
2
3
4
5
## 2)查看几个特征的偏度和峰度
for col in numeric_features:
print("{:15}".format(col),"Skewness:{:05.2f}".format(Train_data[col].skew()),
" ",
"Kurtosis:{:06.2f}".format(Train_data[col].kurt()))
power           Skewness:65.86     Kurtosis:5733.45
kilometer       Skewness:-1.53     Kurtosis:001.14
v_0             Skewness:-1.32     Kurtosis:003.99
v_1             Skewness:00.36     Kurtosis:-01.75
v_2             Skewness:04.84     Kurtosis:023.86
v_3             Skewness:00.11     Kurtosis:-00.42
v_4             Skewness:00.37     Kurtosis:-00.20
v_5             Skewness:-4.74     Kurtosis:022.93
v_6             Skewness:00.37     Kurtosis:-01.74
v_7             Skewness:05.13     Kurtosis:025.85
v_8             Skewness:00.20     Kurtosis:-00.64
v_9             Skewness:00.42     Kurtosis:-00.32
v_10            Skewness:00.03     Kurtosis:-00.58
v_11            Skewness:03.03     Kurtosis:012.57
v_12            Skewness:00.37     Kurtosis:000.27
v_13            Skewness:00.27     Kurtosis:-00.44
v_14            Skewness:-1.19     Kurtosis:002.39
price           Skewness:03.35     Kurtosis:019.00

每个数字特征得分布可视化

1
2
3
4
5
6
## 3)每个数字特征得分布可视化
f = pd.melt(Train_data, value_vars=numeric_features) # 转换
g = sns.FacetGrid(f,col="variable", col_wrap=2, sharex=False,sharey=False) # 以”variable“作“格子"绘图
# plt.show()
g = g.map(sns.distplot, "value") # 以”value“绘制到”格子”图中
plt.show()

13

  • 可以看出匿名特征相对分布均匀

数字特征相互之间的关系可视化

1
2
3
4
5
## 4)数字特征相互之间的关系可视化
sns.set() # 风格设置
colunms = ["price", "v_12", "v_8", "v_0", "power", "v_5", "v_2", "v_6", "v_1", "v_14"]
sns.pairplot(Train_data[colunms],size=2, kind="scatter", diag_kind="kde") # 多变量图
plt.show()

14

多变量互相回归关系可视化

  • 此处是多变量之间的关系可视化,可视化更多学习可参考很不错的文章https://www.jianshu.com/p/6e18d21a4cad

    1
    print(Train_data.columns)

    Index([‘SaleID’, ‘name’, ‘regDate’, ‘model’, ‘brand’, ‘bodyType’, ‘fuelType’,

     'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
     'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
     'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
    dtype='object')
1
print(Y_train)
0         1850
1         3600
2         6222
3         2400
4         5200
      ... 
149995    5900
149996    9500
149997    7500
149998    4999
149999    4700
Name: price, Length: 150000, dtype: int64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
## 5)多变量互相关系回归关系可视化
fig,((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20)) # 生成5行2列十个子图
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = pd.concat([Y_train,Train_data["v_12"]], axis=1) # 合并成一列
#print(v_12_scatter_plot)
sns.regplot(x="v_12", y="price", data=v_12_scatter_plot,scatter=True,fit_reg=True,ax=ax1) # 数据与回归模型拟合

v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)
sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)

v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)

power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)

v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)

v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)

v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)

v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)

v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)

v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)
plt.show()

15

类别特征分析

nunique分布

1
2
3
## 1)nunique分布
for fea in categorical_features:
print(Train_data[fea].nunique())
99662
248
40
8
7
2
2
7905
1
print(categorical_features)
['name', 
'model', 
'brand', 
'bodyType', 
'fuelType', 
'gearbox', 
'notRepairedDamage', 
'regionCode']

类别特征箱形图可视化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
## 2)类别箱形图可视化
# 因为 name和 regionCode的类别太稀疏了,这里我们把不稀疏的几类画一下
categorical_features = ["model",
"brand",
"bodyType",
"fuelType",
"gearbox",
"notRepairedDamage"]
for c in categorical_features:
Train_data[c] = Train_data[c].astype("category") # 强制转换数据类型
if Train_data[c].isnull().any(): # 检查字段缺失
Train_data[c] = Train_data[c].cat.add_categories(["MISSING"]) # 添加新类别
Train_data[c] = Train_data[c].fillna("MISSING") # 填充为NAN的值
def boxplot(x, y, **kwargs):
sns.boxplot(x=x, y=y) # 箱形图
x=plt.xticks(rotation=90) # 设置坐标轴

f = pd.melt(Train_data, id_vars=["price"], value_vars=categorical_features)
g = sns.FacetGrid(f,col="variable", col_wrap=2, sharex=False,sharey=False,size=5)
g = g.map(boxplot, "value", "price")
plt.show()

16

类别特征的小提琴图可视化

1
print(Train_data.columns)
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
       'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
      dtype='object')
1
2
3
4
5
6
## 3)类别特征的小提琴图可视化
catg_list = categorical_features
target = "price"
for catg in catg_list:
sns.violinplot(x=catg,y=target,data=Train_data)
plt.show()

17
18
19
20
21
22

类别特征的柱形图可视化

1
print(categorical_features)
['model', 
'brand', 
'bodyType', 
'fuelType', 
'gearbox', 
'notRepairedDamage']
1
2
3
4
5
6
7
8
## 4)类别特征的柱形图可视化
def bar_plot(x,y,**kwargs): # 柱形图
sns.barplot(x=x,y=y)
x=plt.xticks(rotation=90)
f = pd.melt(Train_data, id_vars=["price"], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",col_wrap=2,sharex=False,sharey=False,size=5)
g = g.map(bar_plot, "value", "price")
plt.show()

23

类别特征的每个类别频数可视化

1
2
3
4
5
6
7
8
## 5)类别特征的每个类别频数可视化
def count_plot(x,**kwargs): # 计数直方图
sns.countplot(x=x)
x=plt.xticks(rotation=90)
f = pd.melt(Train_data,value_vars=categorical_features)
g = sns.FacetGrid(f,col="variable", col_wrap=2,sharex=False,sharey=False,size=5)
g = g.map(count_plot,"value")
plt.show()

24

用pandas_profiling生成数据报告

1
2
3
4
5
## 生成数据报告
import pandas_profiling

pfr = pandas_profiling.ProfileReport(Train_data)
pfr.to_file("./example.html")

代码片段

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
# 导入warnings包,利用过滤器来实现忽略警告语句
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

## pd.set_option('display.max_columns', None)# 显示所有列
## pd.set_option('display.max_row', None)# 显示所有行
## 1)载入训练集和测试集
Train_data = pd.read_csv("./datalab/used_car_train_20200313.csv", sep = " ")
Test_data = pd.read_csv("./datalab/used_car_testA_20200313.csv", sep = " ")

## 2)简略观察数据(head()+shape)
#print(Train_data.head().append(Train_data.tail()))
#print(Train_data.shape)
#
# ## 3)通过describe()来熟悉相关统计量
# print(Train_data.describe())
#
# ## 4)通过info()来熟悉数据类型
# print(Train_data.info())
#
# ## 5)判断数据缺失和异常
# print(Train_data.isnull().sum())
#
#nan可视化
# missing = Train_data.isnull().sum()
# missing = missing[missing > 0]
# missing.sort_values(inplace=True) # 排序
# missing.plot.bar() # 绘柱状图
# plt.tight_layout() # 自动调整子图参数
# plt.show()
# # # 可视化看下缺省值
# msno.matrix(Train_data.sample(250))
# # plt.show()
# msno.bar(Train_data.sample(1000)) # 条形图
# plt.show()

## 6)查看异常值检测
# Train_data.info()
## print(Train_data["notRepairedDamage"].value_counts()) # 返回包含值和count
Train_data["notRepairedDamage"].replace("-", np.nan, inplace=True) # 将数据中‘-’替换成nan值
# print(Train_data.isnull().sum())

#print(Train_data["notRepairedDamage"].value_counts())
#Test_data.info()
##print(Test_data["notRepairedDamage"].value_counts())
#Test_data["notRepairedDamage"].replace("-", np.nan, inplace=True)
##print(Test_data["notRepairedDamage"].value_counts())

# 删除严重倾斜的数据
#print(Train_data["seller"].value_counts())
#print(Train_data["offerType"].value_counts())
# print(Test_data["seller"].value_counts())
# print(Test_data["offerType"].value_counts())

del Train_data["seller"]
del Train_data["offerType"]
# print(Train_data.info())
# print(Train_data.shape)
#del Test_data["seller"]
#del Test_data["offerType"]




# 了解预测值的分布
# print(Train_data["price"])
# print(Train_data["price"].value_counts())

## 1)总体分布情况(无界约翰逊分布等)
import scipy.stats as st
# y = Train_data["price"]
# plt.figure(1); plt.title("Johnson SU") # 创建新图
# sns.distplot(y, kde=False, fit=st.johnsonsu)
# plt.figure(2); plt.title("Normal")
# sns.distplot(y, kde=False, fit=st.norm)
# plt.figure(3); plt.title("Log Normal")
# sns.distplot(y, kde=False, fit=st.lognorm)
# plt.show() # 最佳拟合是无界约翰逊分布

## 2)查看skewness and kurtosis
# sns.distplot(Train_data["price"])
# print("Skewness: %f" % Train_data["price"].skew()) # 偏度
# print("Kurtosis: %f" % Train_data["price"].kurt()) # 峰度
# plt.show()

# print(Train_data.skew())
# print(Train_data.kurt())
# sns.distplot(Train_data.skew(), color="blue", axlabel="Skewness")
# plt.show()
# sns.distplot(Train_data.kurt(), color="orange", axlabel="Kurtness")
# plt.show()

# 3)查看预测值的具体频数
# plt.hist(Train_data["price"], orientation="vertical", histtype="bar", color="red")
# plt.show() # 直方图
# log变换之后的分布比较均匀,可以进行log变换进行预测,这也是预测问题常用的trick
# plt.hist(np.log(Train_data["price"]), orientation="vertical", histtype="bar", color="red")
# plt.show()



## 查看特征
# 分离label即预测值
Y_train = Train_data["price"]
## 这个区别方式适用于没有直接label coding的数据
## 这里不适用,需要人为根据实际含义来区分
## 数字特征
## numeric_features = Train_data.select_dtypes(include=[np.number])
## numeric_features.columns
## # 类型特征
## categorical_features = Train_data.select_dtypes(include=[np.object])
## categorical_features.columns



# 数字特征
numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]
# 类别特征
categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode']
## 类别特征nunique分布——Train_data
# for cat_fea in categorical_features:
# print(cat_fea+"的特征分布如下:")
# print("{}特征有{}个不同的值".format(cat_fea, Train_data[cat_fea].nunique()))
# print(Train_data[cat_fea].value_counts())
## 类别特征nunique分布——Test_data
# for cat_fea in categorical_features:
# print(cat_fea+"的特征分布如下:")
# print("{}特征有{}个不同的值".format(cat_fea, Test_data[cat_fea].nunique()))
# print(Test_data[cat_fea].value_counts())

## 数字特征分析
numeric_features.append("price")
# print(numeric_features)
#print(Train_data.head())
## 1)相关性分析
price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr() # 返回一个相关系数的矩阵
# print(correlation["price"].sort_values(ascending=False),"\n") # 降序排序

# f , ax = plt.subplots(figsize = (7, 7))
# plt.title("Correlation of Numeric Features with Price")
# sns.heatmap(correlation, square=True, vmax=0.8) # 热图(显示相关系数)
# plt.show()

## 2)查看几个特征的偏度和峰度
# for col in numeric_features:
# print("{:15}".format(col),"Skewness:{:05.2f}".format(Train_data[col].skew()),
# " ",
# "Kurtosis:{:06.2f}".format(Train_data[col].kurt()))

## 3)每个数字特征得分布可视化
# f = pd.melt(Train_data, value_vars=numeric_features) # 转换
# g = sns.FacetGrid(f,col="variable", col_wrap=2, sharex=False,sharey=False) # 以”variable“作“格子"绘图
# # plt.show()
# g = g.map(sns.distplot, "value") # 以”value“绘制到”格子”图中
# plt.show()

## 4)数字特征相互之间的关系可视化
# sns.set() # 风格设置
# colunms = ["price", "v_12", "v_8", "v_0", "power", "v_5", "v_2", "v_6", "v_1", "v_14"]
# sns.pairplot(Train_data[colunms],size=2, kind="scatter", diag_kind="kde") # 多变量图
# plt.show()

# print(Train_data.columns)
# print(Y_train)


## 5)多变量互相关系回归关系可视化
# fig,((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20)) # 生成5行2列十个子图
# # ['v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14']
# v_12_scatter_plot = pd.concat([Y_train,Train_data["v_12"]], axis=1) # 合并成一列
# #print(v_12_scatter_plot)
# sns.regplot(x="v_12", y="price", data=v_12_scatter_plot,scatter=True,fit_reg=True,ax=ax1) # 数据与回归模型拟合
#
# v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)
# sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)
#
# v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
# sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)
#
# power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
# sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)
#
# v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
# sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)
#
# v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
# sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)
#
# v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
# sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)
#
# v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
# sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)
#
# v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
# sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)
#
# v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
# sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)
# plt.show()

# 类别特征分析
## 1)nunique分布
# for fea in categorical_features:
# print(Train_data[fea].nunique())
#
# print(categorical_features)

## 2)类别箱形图可视化
# 因为 name和 regionCode的类别太稀疏了,这里我们把不稀疏的几类画一下
categorical_features = ["model",
"brand",
"bodyType",
"fuelType",
"gearbox",
"notRepairedDamage"]
for c in categorical_features:
Train_data[c] = Train_data[c].astype("category") # 强制转换数据类型
if Train_data[c].isnull().any(): # 检查字段缺失
Train_data[c] = Train_data[c].cat.add_categories(["MISSING"]) # 添加新类别
Train_data[c] = Train_data[c].fillna("MISSING") # 填充为NAN的值
# def boxplot(x, y, **kwargs):
# sns.boxplot(x=x, y=y) # 箱形图
# x=plt.xticks(rotation=90) # 设置坐标轴
#
# f = pd.melt(Train_data, id_vars=["price"], value_vars=categorical_features)
# g = sns.FacetGrid(f,col="variable", col_wrap=2, sharex=False,sharey=False,size=5)
# g = g.map(boxplot, "value", "price")
# plt.show()

## 3)类别特征的小提琴图可视化
#print(Train_data.columns)
# catg_list = categorical_features
# target = "price"
# for catg in catg_list:
# sns.violinplot(x=catg,y=target,data=Train_data)
# plt.show()

# print(categorical_features)

## 4)类别特征的柱形图可视化
# def bar_plot(x,y,**kwargs): # 柱形图
# sns.barplot(x=x,y=y)
# x=plt.xticks(rotation=90)
# f = pd.melt(Train_data, id_vars=["price"], value_vars=categorical_features)
# g = sns.FacetGrid(f, col="variable",col_wrap=2,sharex=False,sharey=False,size=5)
# g = g.map(bar_plot, "value", "price")
# plt.show()

## 5)类别特征的每个类别频数可视化
# def count_plot(x,**kwargs): # 计数直方图
# sns.countplot(x=x)
# x=plt.xticks(rotation=90)
# f = pd.melt(Train_data,value_vars=categorical_features)
# g = sns.FacetGrid(f,col="variable", col_wrap=2,sharex=False,sharey=False,size=5)
# g = g.map(count_plot,"value")
# plt.show()

## 生成数据报告
import pandas_profiling
#
# pfr = pandas_profiling.ProfileReport(Train_data)
# pfr.to_file("./example.html")

经验总结

所给出的EDA步骤为广为普遍的步骤,在实际的不管是工程还是比赛过程中,这只是最开始的一步,也是最基本的一步。

接下来一般要结合模型的效果以及特征工程等来分析数据的实际建模情况,根据自己的一些理解,查阅文献,对实际问题做出判断和深入的理解。

最后不断进行EDA与数据处理和挖掘,来到达更好的数据结构和分布以及较为强势相关的特征

数据探索在机器学习中我们一般称为EDA(Exploratory Data Analysis):

是指对已有的数据(特别是调查或观察得来的原始数据)在尽量少的先验假定下进行探索,通过作图、制表、方程拟合、计算特征量等手段探索数据的结构和规律的一种数据分析方>法。

数据探索有利于我们发现数据的一些特性,数据之间的关联性,对于后续的特征构建是很有帮助的。

  1. 对于数据的初步分析(直接查看数据,或.sum(), .mean(),.descirbe()等统计函数)可以从:样本数量,训练集数量,是否有时间特征,是否是时许问题,特征所表示的含义(非匿名特征),特征类型(字符类似,int,float,time),特征的缺失情况(注意缺失的在数据中的表现形式,有些是空的有些是”NAN”符号等),特征的均值方差情况。

  2. 分析记录某些特征值缺失占比30%以上样本的缺失处理,有助于后续的模型验证和调节,分析特征应该是填充(填充方式是什么,均值填充,0填充,众数填充等),还是舍去,还是先做样本分类用不同的特征模型去预测。

  3. 对于异常值做专门的分析,分析特征异常的label是否为异常值(或者偏离均值较远或者是特殊符号),异常值是否应该剔除,还是用正常值填充,是记录异常,还是机器本身异常等。

  4. 对于Label做专门的分析,分析标签的分布情况等。

  5. 进步分析可以通过对特征作图,特征和label联合做图(统计图,离散图),直观了解特征的分布情况,通过这一步也可以发现数据之中的一些异常值等,通过箱型图分析一些特征值的偏离情况,对于特征和特征联合作图,对于特征和label联合作图,分析其中的一些关联性。