SelectKBest
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| import pandas as pd import numpy as np
data = pd.read_csv('covid.train.csv') x = data[data.columns[1:94]] y = data[data.columns[94]]
from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_regression
from sklearn import preprocessing x = (x - x.min()) / (x.max() - x.min())
bestfeatures = SelectKBest(score_func=f_regression, k=5) fit = bestfeatures.fit(x,y) dfscores = pd.DataFrame(fit.scores_) dfcolumns = pd.DataFrame(x.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1) featureScores.columns = ['Specs','Score'] print(featureScores.nlargest(20,'Score'))
|
通过以上代码,可以学习到 Sklearn的 SelectKBest
根据文档描述 Select features according to the k highest scores. 它有两个参数,一个是score_*func,*一个则是k.
score_func是函数,它的作用是给特征进行打分,然后从高到底选取特征。
该函数用于选取关联性较好的特征
train_test_split
1 2 3 4
| from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.1, random_state=2)
|
该函数用于分离训练集和测试集
KFold, StratifiedKFold, GroupKFold
该函数常用于K折交叉验证
KFold
1 2 3 4 5 6 7 8 9
| >>> import numpy as np >>> from sklearn.model_selection import KFold
>>> X = ["a", "b", "c", "d"] >>> kf = KFold(n_splits=2) >>> for train, test in kf.split(X): ... print("%s %s" % (train, test)) [2 3] [0 1] [0 1] [2 3]
|
直接将数据随机分成K折
StratifiedKFold
1 2 3 4 5 6 7 8 9 10
| >>> from sklearn.model_selection import StratifiedKFold
>>> X = np.ones(10) >>> y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1] >>> skf = StratifiedKFold(n_splits=3) >>> for train, test in skf.split(X, y): ... print("%s %s" % (train, test)) [2 3 6 7 8 9] [0 1 4 5] [0 1 3 4 5 8 9] [2 6 7] [0 1 2 4 5 6 7] [3 8 9]
|
StratifiedKFold根据数据集进行划分,使划分的数据集的目标比例和原始数据集近似。
GroupKFold
1 2 3 4 5 6 7 8 9 10 11 12
| >>> from sklearn.model_selection import GroupKFold
>>> X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10] >>> y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"] >>> groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]
>>> gkf = GroupKFold(n_splits=3) >>> for train, test in gkf.split(X, y, groups=groups): ... print("%s %s" % (train, test)) [0 1 2 3 4 5] [6 7 8 9] [0 1 2 6 7 8 9] [3 4 5] [3 4 5 6 7 8 9] [0 1 2]
|
GroupKFold会保证同一个Group的数据集不会同时出现在训练集和测试集上。
更新 ing