sklearn进行特征选择和数据预处理

本文主要针对sklearn上能进行的特征筛选和数据清洗代码做一个简单展示。

特征选择

对特征进行选择，我们可以进行如下操作。

删除低方差特征

from sklearn.feature_selection import VarianceThreshold
X = [[0,0,1],[0,1,0],[1,0,0],[0,1,1],[0,1,0],[0,1,1]]
sel = VarianceThreshold(threshold=(.8 * (1-.8)))
sel.fit_transform(X)

此时的数组已经删除方差不满足threshold的那一列特征。

array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

用threshold筛选低方差一般选取1.5~2.0。

单变量特征选择

SelectKBest 保留评分最高的K个特征
SelectPercentile 保留最高得分百分比之几的特征
对每个特征应用常见的单变量统计测试：假阳性率（false positive rate) SelectFpr，伪发现率（false discovery rate) SelectFdr ，或者族系误差（family wise error)SelectFwe。
GenericUnivariateSelect 允许使用可配置方法来进行单变量特征选择。它允许超参数搜索评估器来选择最好的单变量特征。
将得分函数作为输入，返回单变量的得分和p值（或者仅仅是SelectKBest和SelectPercentile 的分数）：
对于回归： f_regression，mutual_inf_regression
对于分类： chi2，f_classif，mutual_inf_classif
这在之前的数据EDA一文中有例子。

基于树模型的特征选择

用鸢尾花数据集举例。

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X,y = iris.data, iris.target
X.shape

此时X是（150，4）的数组。

clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(X,y)
clf.feature_importances_
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
X_new.shape

经过特征筛选后，X_new是（150，2）的数组。

数据预处理sklearn模块

标准化scale

使用sklearn.preprocessing.scale()函数

1 2	from sklearn import preprocessing X_scale = preprocessing.scale(X)

sklearn.preprocessing.StandardScaler类

1 2	scaler = preprocessing.StandardScaler() X_scaled = scaler.fit_transform(X)

将特征的取值缩小到某一范围

缩放到0到1：

1 2	min_max_scaler = preprocessing.MinMaxScaler() X_minMax = min_max_scaler.fit_transform(X)

缩放到-1到1：

1 2	max_abs_scaler = preprocessing.MaxAbsScaler() X_maxabs = max_abs_scaler.fit_transform(X)

独热编码Encoding categorical features

preprocessing.OrdinalEncoder()
preprocessing.OneHotEncoder()

多项式特征Generating polynomial features poly

进行多项式完全交叉相乘：即X的特征从（X1, X2）转换为（1，X1， X2， X1^2， X1X2， X2^2)

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
X = np.arange(6).reshape(3,2)
poly = PolynomialFeatures(2) # 转换为2阶
poly.fit_transform(X)

只进行多项式不同项交叉相乘：即X的特征从（X1，X2，X3）转换为（1，X1， X2，X3， X1X2， X1X3，X2X3， X1X2X3)

1
2
3

X = np.arange(9).reshape(3,3)
poly = PolynomialFeatures(degree=3,interaction_only=True) # 转换为3阶仅不同项交叉相乘
poly.fit_transform(X)