SelectKBest

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd
import numpy as np

data = pd.read_csv('covid.train.csv')
x = data[data.columns[1:94]]
y = data[data.columns[94]]

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

from sklearn import preprocessing
x = (x - x.min()) / (x.max() - x.min())

bestfeatures = SelectKBest(score_func=f_regression, k=5)
fit = bestfeatures.fit(x,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(x.columns)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming the dataframe columns
print(featureScores.nlargest(20,'Score')) #print 15 best features

通过以上代码,可以学习到 Sklearn的 SelectKBest

根据文档描述 Select features according to the k highest scores. 它有两个参数,一个是score_*func,*一个则是k.

score_func是函数,它的作用是给特征进行打分,然后从高到底选取特征。

该函数用于选取关联性较好的特征

train_test_split

1
2
3
4
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(data, y,
test_size=0.1, random_state=2) #将数据划分训练集和测试集,random_state随机数种子

该函数用于分离训练集和测试集

KFold, StratifiedKFold, GroupKFold

该函数常用于K折交叉验证

KFold

1
2
3
4
5
6
7
8
9
>>> import numpy as np
>>> from sklearn.model_selection import KFold

>>> X = ["a", "b", "c", "d"]
>>> kf = KFold(n_splits=2)
>>> for train, test in kf.split(X):
... print("%s %s" % (train, test))
[2 3] [0 1]
[0 1] [2 3]

直接将数据随机分成K折

StratifiedKFold

1
2
3
4
5
6
7
8
9
10
>>> from sklearn.model_selection import StratifiedKFold

>>> X = np.ones(10)
>>> y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
>>> skf = StratifiedKFold(n_splits=3)
>>> for train, test in skf.split(X, y):
... print("%s %s" % (train, test))
[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]

StratifiedKFold根据数据集进行划分,使划分的数据集的目标比例和原始数据集近似。

GroupKFold

1
2
3
4
5
6
7
8
9
10
11
12
>>> from sklearn.model_selection import GroupKFold

>>> X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
>>> y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
>>> groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

>>> gkf = GroupKFold(n_splits=3)
>>> for train, test in gkf.split(X, y, groups=groups):
... print("%s %s" % (train, test))
[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]

GroupKFold会保证同一个Group的数据集不会同时出现在训练集和测试集上。

更新 ing