机器学习算法中如何执行回归数据的特征选择( 五 ) _机器学习

from sklearn.datasets import make_regressionfrom sklearn.model_selection import RepeatedKFoldfrom sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import mutual_info_regressionfrom sklearn.linear_model import LinearRegressionfrom sklearn.pipeline import Pipelinefrom sklearn.model_selection import GridSearchCV# define datasetX, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)# define the evaluation methodcv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)# define the pipeline to evaluatemodel = LinearRegression()fs = SelectKBest(score_func=mutual_info_regression)pipeline = Pipeline(steps=[('sel',fs), ('lr', model)])# define the gridgrid = dict()grid['sel__k'] = [i for i in range(X.shape[1]-20, X.shape[1]+1)]# define the grid searchsearch = GridSearchCV(pipeline, grid, scoring='neg_mean_squared_error', n_jobs=-1, cv=cv)# perform the searchresults = search.fit(X, y)# summarize bestprint('Best MAE: %.3f' % results.best_score_)print('Best Config: %s' % results.best_params_)# summarize allmeans = results.cv_results_['mean_test_score']params = results.cv_results_['params']for mean, param in zip(means, params):print(">%.3f with: %r" % (mean, param))运行示例选择不同数量的网格搜索功能使用相关信息统计,其中每个建模评估通道使用重复交叉验证。
在这种情况下，我们可以看到，选择的特征的最佳数量是81，这使得MAE达到大约0.082(忽略符号) 。

文章插图
鉴于学习算法和评估程序的随机性，您的具体结果可能会有所不同。尝试运行该示例几次。
我们可能希望查看所选特征的数量与MAE之间的关系。在这种关系中，我们可以预期到更多的特征会带来更好的性能。
通过手动评估SelectKBest从81到100 的k的每个配置，收集MAE分数样本，并使用箱型图和须状图并排绘制结果来探索结果。这些箱形图的分布和均值将显示所选特征的数量与管道的MAE之间任何有趣的关系。
请注意，由于k = 80的MAE分数的分布远大于所考虑的k的所有其他值，因此我们从81而不是80开始了k值的传播。
下面列出了实现此目的的完整示例。

from numpy import meanfrom numpy import stdfrom sklearn.datasets import make_regressionfrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import RepeatedKFoldfrom sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import mutual_info_regressionfrom sklearn.linear_model import LinearRegressionfrom sklearn.pipeline import Pipelinefrom matplotlib import pyplot# define datasetX, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)# define number of features to evaluatenum_features = [i for i in range(X.shape[1]-19, X.shape[1]+1)]# enumerate each number of featuresresults = list()for k in num_features:	# create pipeline	model = LinearRegression()	fs = SelectKBest(score_func=mutual_info_regression, k=k)	pipeline = Pipeline(steps=[('sel',fs), ('lr', model)])	# evaluate the model	cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)	scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)	results.Append(scores)	# summarize the results	print('>%d %.3f (%.3f)' % (k, mean(scores), std(scores)))# plot model performance for comparisonpyplot.boxplot(results, labels=num_features, showmeans=True)pyplot.show()

运行这个示例，显示了每个选定特征数量的平均值和标准差MAE 。
在这种情况下，报告MAE的均值和标准差不是很有趣，除了80的k值比90的k值更好。