39 · scikit-learn 实战:预处理、特征选择与交叉验证
🔗 知识图谱导航:阅读本文前,建议先掌握《32 · Pandas 实战》中的数据清洗和《35 · 线性代数实战》中的矩阵运算——scikit-learn 的底层是 NumPy 矩阵运算,输入输出都是 ndarray。
运行环境:
pip install scikit-learn numpy。
极客解析:scikit-learn 的核心设计是"估计器接口":所有对象都有
fit(X, y)和transform(X)或predict(X)方法。Pipeline把多个估计器串联,保证预处理只在训练集上fit,防止数据泄露。
scikit-learn 核心接口
fit(X, y) 在训练集上学习参数
transform(X) 用学到的参数转换数据
fit_transform(X,y) fit + transform 合并(只用于训练集)
predict(X) 预测新数据的标签
score(X, y) 评估模型性能
Pipeline 串联多个估计器,防止数据泄露
cross_val_score k 折交叉验证,评估泛化能力
GridSearchCV 网格搜索最优超参数
步步为营:核心逻辑自适应拆解
这一篇按 sklearn 机器学习流水线拆成 7 个台阶:造数据、打印实验表、预处理、特征选择、交叉验证、网格搜索,最后用 CLI 调度所有模式。每个演示都补了 Mock 数据和 print() 反馈。
Step 1:用 make_data 生成离线分类数据集
痛点与机制:
make_data 是机器学习的造数工厂,用 sklearn 内置函数生成分类数据,不需要联网下载。X 是特征表,像学生的多门成绩;y 是标签,像最终是否及格。后面所有预处理、选择、训练都围绕这两份数组展开。
核心源码(逐字来自文末完整源码):
def make_data(n_samples: int = 1000, n_features: int = 20) -> Tuple[np.ndarray, np.ndarray]:
X, y = make_classification(
n_samples=n_samples, n_features=n_features,
n_informative=10, n_redundant=5, random_state=42
)
return X, y
可运行演示(补齐 Mock 数据与 print 反馈):
from typing import Tuple
import numpy as np
from sklearn.datasets import make_classification
def make_data(n_samples: int = 1000, n_features: int = 20) -> Tuple[np.ndarray, np.ndarray]:
X, y = make_classification(
n_samples=n_samples, n_features=n_features,
n_informative=10, n_redundant=5, random_state=42
)
return X, y
X, y = make_data(n_samples=120, n_features=20)
print("📦 分类数据已生成")
print("X shape:", X.shape)
print("y shape:", y.shape)
print("类别分布:", dict(zip(*np.unique(y, return_counts=True))))
print("第一行特征:", np.round(X[0, :5], 3).tolist())
Step 2:用 print_table 把实验结果排成终端表格
痛点与机制:
print_table 是终端里的小报表工具。机器学习流程会产生很多对比结果,如果不排成表,新手很难判断哪一步更好。它像把实验记录贴进表格,让每个指标都有固定位置。
核心源码(逐字来自文末完整源码):
def print_table(headers: list, rows: list, title: str = "") -> None:
if title:
print(f"\n{'='*60}")
print(f" {title}")
print(f"{'='*60}")
col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
for i, h in enumerate(headers)]
sep = "┼".join("─" * (w + 2) for w in col_widths)
header_line = "│".join(f" {str(h):<{w}} " for h, w in zip(headers, col_widths))
print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
print(f"│{header_line}│")
print(f"├{sep}┤")
for row in rows:
print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")
可运行演示(补齐 Mock 数据与 print 反馈):
def print_table(headers: list, rows: list, title: str = "") -> None:
if title:
print(f"\n{'='*60}")
print(f" {title}")
print(f"{'='*60}")
col_widths = [max(len(str(h)), 12) for h in headers]
for row in rows:
for i, cell in enumerate(row):
col_widths[i] = max(col_widths[i], len(str(cell)))
fmt = " " + " ".join(f"{{:<{w}}}" for w in col_widths)
print(fmt.format(*headers))
print(" " + " ".join("-" * w for w in col_widths))
for row in rows:
print(fmt.format(*row))
print_table(
["步骤", "作用", "新手理解"],
[["Scaler", "统一量纲", "把身高体重换成同一把尺"], ["Model", "训练分类器", "让机器学会判断类别"]],
title="Pipeline 组件说明",
)
Step 3:用 StandardScaler/MinMaxScaler 统一特征尺度
痛点与机制:
预处理是在训练前统一尺度。StandardScaler 把数据变成均值约 0、标准差约 1;MinMaxScaler 把数据压进 0 到 1。它像把厘米、米、公里统一成同一把尺,否则模型会被“大数字特征”带偏。
核心源码(逐字来自文末完整源码):
def mode_preprocess(X: np.ndarray) -> None:
"""演示标准化与归一化"""
print(f"\n[{nexdo_time()}] 数据预处理演示")
sample = X[:5, :4]
std = StandardScaler().fit_transform(sample)
mm = MinMaxScaler().fit_transform(sample)
rows = []
for i in range(5):
rows.append([
f"样本{i}",
f"{sample[i,0]:.3f}",
f"{std[i,0]:.3f}",
f"{mm[i,0]:.3f}",
])
print_table(["样本", "原始值(f0)", "标准化", "归一化"], rows, "预处理效果对比(特征0)")
可运行演示(补齐 Mock 数据与 print 反馈):
from typing import Tuple
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler, MinMaxScaler
def make_data(n_samples: int = 1000, n_features: int = 20) -> Tuple[np.ndarray, np.ndarray]:
X, y = make_classification(n_samples=n_samples, n_features=n_features, n_informative=10, n_redundant=5, random_state=42)
return X, y
def mode_preprocess(X: np.ndarray) -> None:
std = StandardScaler().fit_transform(X)
mm = MinMaxScaler().fit_transform(X)
rows = [
["原始", f"{X[:,0].mean():.3f}", f"{X[:,0].std():.3f}", f"{X[:,0].min():.3f}~{X[:,0].max():.3f}"],
["Standard", f"{std[:,0].mean():.3f}", f"{std[:,0].std():.3f}", f"{std[:,0].min():.3f}~{std[:,0].max():.3f}"],
["MinMax", f"{mm[:,0].mean():.3f}", f"{mm[:,0].std():.3f}", f"{mm[:,0].min():.3f}~{mm[:,0].max():.3f}"],
]
print("预处理对比(只看第0列):")
for row in rows:
print(" ", row)
X, y = make_data(n_samples=120, n_features=20)
mode_preprocess(X)
Step 4:用 SelectKBest/RFE 选出更有用的特征
痛点与机制:
特征选择是在问:“哪些列最有用?” SelectKBest 像按单科成绩排名,直接挑统计分数高的列;RFE 像淘汰赛,训练模型后逐步删掉贡献小的特征。这样可以减少噪声,也让模型更容易解释。
核心源码(逐字来自文末完整源码):
def mode_select(X: np.ndarray, y: np.ndarray) -> None:
"""演示特征选择"""
print(f"\n[{nexdo_time()}] 特征选择演示")
selector = SelectKBest(f_classif, k=5)
selector.fit(X, y)
scores = selector.scores_
top5_idx = np.argsort(scores)[::-1][:5]
rows = [(f"特征{i}", f"{scores[i]:.2f}", "✓" if i in top5_idx else "") for i in range(10)]
print_table(["特征", "F分数", "入选Top5"], rows, "SelectKBest 特征评分(前10)")
rfe = RFE(LogisticRegression(max_iter=500, random_state=42), n_features_to_select=5)
rfe.fit(X, y)
rows2 = [(f"特征{i}", rfe.ranking_[i], "✓" if rfe.support_[i] else "") for i in range(10)]
print_table(["特征", "RFE排名", "入选"], rows2, "RFE 递归特征消除(前10)")
可运行演示(补齐 Mock 数据与 print 反馈):
from typing import Tuple
import numpy as np
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.linear_model import LogisticRegression
def make_data(n_samples: int = 1000, n_features: int = 20) -> Tuple[np.ndarray, np.ndarray]:
X, y = make_classification(n_samples=n_samples, n_features=n_features, n_informative=10, n_redundant=5, random_state=42)
return X, y
def mode_select(X: np.ndarray, y: np.ndarray) -> None:
skb = SelectKBest(f_classif, k=8).fit(X, y)
scores = skb.scores_
top_idx = np.argsort(scores)[-8:][::-1]
print("SelectKBest Top 特征:")
for idx in top_idx[:5]:
print(f" feature_{idx:<2} score={scores[idx]:.2f}")
model = LogisticRegression(max_iter=1000)
rfe = RFE(model, n_features_to_select=8).fit(X, y)
print("RFE 选择特征:", np.where(rfe.support_)[0].tolist())
X, y = make_data(n_samples=160, n_features=16)
mode_select(X, y)
Step 5:用 Pipeline + cross_val_score 防止数据泄露
痛点与机制:
Pipeline + cross_val_score 是防数据泄露的关键。Scaler 必须只在训练折里学习均值和方差,不能提前看验证集。Pipeline 像密封流水线,保证每一折都按正确顺序做预处理和训练。
核心源码(逐字来自文末完整源码):
def mode_cv(X: np.ndarray, y: np.ndarray) -> None:
"""演示交叉验证"""
print(f"\n[{nexdo_time()}] 交叉验证演示")
pipe = Pipeline([
("scaler", StandardScaler()),
("selector", SelectKBest(f_classif, k=10)),
("clf", LogisticRegression(max_iter=500, random_state=42)),
])
scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")
rows = [(f"Fold {i+1}", f"{s:.4f}") for i, s in enumerate(scores)]
rows.append(["均值", f"{scores.mean():.4f}"])
rows.append(["标准差", f"{scores.std():.4f}"])
print_table(["折次", "准确率"], rows, "5折交叉验证结果")
可运行演示(补齐 Mock 数据与 print 反馈):
from typing import Tuple
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
def make_data(n_samples: int = 1000, n_features: int = 20) -> Tuple[np.ndarray, np.ndarray]:
X, y = make_classification(n_samples=n_samples, n_features=n_features, n_informative=10, n_redundant=5, random_state=42)
return X, y
def mode_cv(X: np.ndarray, y: np.ndarray) -> None:
pipe = Pipeline([
("scale", StandardScaler()),
("model", LogisticRegression(max_iter=1000))
])
scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")
print("5折交叉验证 accuracy:", np.round(scores, 4).tolist())
print(f"平均={scores.mean():.4f}, 标准差={scores.std():.4f}")
X, y = make_data(n_samples=180, n_features=20)
mode_cv(X, y)
Step 6:用 GridSearchCV 搜索最佳超参数
痛点与机制:
GridSearchCV 是超参数试衣间。比如 LogisticRegression 的 C 控制正则强度,哪个值合适不能拍脑袋,要在交叉验证里逐个试。它会告诉你最好的参数和对应分数。
核心源码(逐字来自文末完整源码):
def mode_grid(X: np.ndarray, y: np.ndarray) -> None:
"""演示网格搜索"""
print(f"\n[{nexdo_time()}] 网格搜索演示")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(max_iter=500, random_state=42)),
])
param_grid = {"clf__C": [0.01, 0.1, 1.0, 10.0], "clf__penalty": ["l2"]}
gs = GridSearchCV(pipe, param_grid, cv=3, scoring="accuracy", n_jobs=-1)
gs.fit(X_train, y_train)
rows = []
for params, mean, std in zip(
gs.cv_results_["params"],
gs.cv_results_["mean_test_score"],
gs.cv_results_["std_test_score"],
):
rows.append([params["clf__C"], f"{mean:.4f}", f"{std:.4f}"])
print_table(["C值", "CV均值", "CV标准差"], rows, "GridSearchCV 结果")
print(f"\n最优参数: {gs.best_params_}")
print(f"测试集准确率: {accuracy_score(y_test, gs.predict(X_test)):.4f}")
可运行演示(补齐 Mock 数据与 print 反馈):
from typing import Tuple
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
def make_data(n_samples: int = 1000, n_features: int = 20) -> Tuple[np.ndarray, np.ndarray]:
X, y = make_classification(n_samples=n_samples, n_features=n_features, n_informative=10, n_redundant=5, random_state=42)
return X, y
def mode_grid(X: np.ndarray, y: np.ndarray) -> None:
pipe = Pipeline([
("scale", StandardScaler()),
("model", LogisticRegression(max_iter=1000))
])
params = {"model__C": [0.1, 1.0, 10.0]}
grid = GridSearchCV(pipe, params, cv=3, scoring="accuracy")
grid.fit(X, y)
print("最佳参数:", grid.best_params_)
print(f"最佳CV分数: {grid.best_score_:.4f}")
X, y = make_data(n_samples=180, n_features=20)
mode_grid(X, y)
Step 7:用 main 做 preprocess/select/cv/grid/full/all 脚本遥控器
痛点与机制:
main 是脚本遥控器:--mode preprocess/select/cv/grid/full/all 对应机器学习流水线的不同阶段。用户不用改代码,只要换参数就能观察每个环节。
核心源码(逐字来自文末完整源码):
def main() -> None:
parser = argparse.ArgumentParser(description="Sklearn 特征工程完整演示")
parser.add_argument("--mode", choices=["preprocess", "select", "cv", "grid", "full"],
default="full", help="运行模式")
parser.add_argument("--samples", type=int, default=1000, help="样本数量")
args = parser.parse_args()
print(f"[{nexdo_time()}] 生成数据集 samples={args.samples}")
X, y = make_data(args.samples)
dispatch = {
"preprocess": lambda: mode_preprocess(X),
"select": lambda: mode_select(X, y),
"cv": lambda: mode_cv(X, y),
"grid": lambda: mode_grid(X, y),
"full": lambda: mode_full(X, y),
}
dispatch[args.mode]()
可运行演示(补齐 Mock 数据与 print 反馈):
import argparse
import sys
import numpy as np
from sklearn.datasets import make_classification
def make_data():
return make_classification(n_samples=80, n_features=8, n_informative=4, random_state=42)
def mode_preprocess(X):
print("运行 preprocess:标准化 / 归一化", X.shape)
def mode_select(X, y):
print("运行 select:选择最相关特征", X.shape[1], "列")
def mode_cv(X, y):
print("运行 cv:Pipeline + cross_val_score")
def mode_grid(X, y):
print("运行 grid:GridSearchCV 搜索超参数")
def mode_full(X, y):
print("运行 full:训练并评估完整模型")
def main() -> None:
parser = argparse.ArgumentParser(description="sklearn Pipeline 完整实战")
parser.add_argument("--mode", choices=["preprocess", "select", "cv", "grid", "full", "all"], default="all")
args = parser.parse_args()
X, y = make_data()
print("数据准备:", X.shape)
if args.mode in ("preprocess", "all"):
mode_preprocess(X)
if args.mode in ("select", "all"):
mode_select(X, y)
if args.mode in ("cv", "all"):
mode_cv(X, y)
if args.mode in ("grid", "all"):
mode_grid(X, y)
if args.mode in ("full", "all"):
mode_full(X, y)
for mode in ["preprocess", "select", "cv", "grid", "full", "all"]:
print(f"\n>>> python3 39-sklearn-pipeline.py --mode {mode}")
sys.argv = ["prog", "--mode", mode]
main()
极客实战:完整源码与运行
现在,把上面的积木拼起来,将以下完整代码放进你的编辑器,运行它。先看整体闭环,再回头逐段改参数,你会更容易建立工程直觉。
#!/usr/bin/env python3
"""
39-sklearn-pipeline.py
完整演示:数据预处理 → 特征选择 → 模型训练 → 交叉验证 → 网格搜索
用法:
python 39-sklearn-pipeline.py --mode preprocess
python 39-sklearn-pipeline.py --mode select
python 39-sklearn-pipeline.py --mode cv
python 39-sklearn-pipeline.py --mode grid
python 39-sklearn-pipeline.py --mode full
"""
import argparse
import time
from typing import Tuple
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
def nexdo_time() -> str:
return time.strftime("%Y-%m-%d %H:%M:%S")
def make_data(n_samples: int = 1000, n_features: int = 20) -> Tuple[np.ndarray, np.ndarray]:
X, y = make_classification(
n_samples=n_samples, n_features=n_features,
n_informative=10, n_redundant=5, random_state=42
)
return X, y
def print_table(headers: list, rows: list, title: str = "") -> None:
if title:
print(f"\n{'='*60}")
print(f" {title}")
print(f"{'='*60}")
col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
for i, h in enumerate(headers)]
sep = "┼".join("─" * (w + 2) for w in col_widths)
header_line = "│".join(f" {str(h):<{w}} " for h, w in zip(headers, col_widths))
print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
print(f"│{header_line}│")
print(f"├{sep}┤")
for row in rows:
print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")
def mode_preprocess(X: np.ndarray) -> None:
"""演示标准化与归一化"""
print(f"\n[{nexdo_time()}] 数据预处理演示")
sample = X[:5, :4]
std = StandardScaler().fit_transform(sample)
mm = MinMaxScaler().fit_transform(sample)
rows = []
for i in range(5):
rows.append([
f"样本{i}",
f"{sample[i,0]:.3f}",
f"{std[i,0]:.3f}",
f"{mm[i,0]:.3f}",
])
print_table(["样本", "原始值(f0)", "标准化", "归一化"], rows, "预处理效果对比(特征0)")
def mode_select(X: np.ndarray, y: np.ndarray) -> None:
"""演示特征选择"""
print(f"\n[{nexdo_time()}] 特征选择演示")
selector = SelectKBest(f_classif, k=5)
selector.fit(X, y)
scores = selector.scores_
top5_idx = np.argsort(scores)[::-1][:5]
rows = [(f"特征{i}", f"{scores[i]:.2f}", "✓" if i in top5_idx else "") for i in range(10)]
print_table(["特征", "F分数", "入选Top5"], rows, "SelectKBest 特征评分(前10)")
rfe = RFE(LogisticRegression(max_iter=500, random_state=42), n_features_to_select=5)
rfe.fit(X, y)
rows2 = [(f"特征{i}", rfe.ranking_[i], "✓" if rfe.support_[i] else "") for i in range(10)]
print_table(["特征", "RFE排名", "入选"], rows2, "RFE 递归特征消除(前10)")
def mode_cv(X: np.ndarray, y: np.ndarray) -> None:
"""演示交叉验证"""
print(f"\n[{nexdo_time()}] 交叉验证演示")
pipe = Pipeline([
("scaler", StandardScaler()),
("selector", SelectKBest(f_classif, k=10)),
("clf", LogisticRegression(max_iter=500, random_state=42)),
])
scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")
rows = [(f"Fold {i+1}", f"{s:.4f}") for i, s in enumerate(scores)]
rows.append(["均值", f"{scores.mean():.4f}"])
rows.append(["标准差", f"{scores.std():.4f}"])
print_table(["折次", "准确率"], rows, "5折交叉验证结果")
def mode_grid(X: np.ndarray, y: np.ndarray) -> None:
"""演示网格搜索"""
print(f"\n[{nexdo_time()}] 网格搜索演示")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(max_iter=500, random_state=42)),
])
param_grid = {"clf__C": [0.01, 0.1, 1.0, 10.0], "clf__penalty": ["l2"]}
gs = GridSearchCV(pipe, param_grid, cv=3, scoring="accuracy", n_jobs=-1)
gs.fit(X_train, y_train)
rows = []
for params, mean, std in zip(
gs.cv_results_["params"],
gs.cv_results_["mean_test_score"],
gs.cv_results_["std_test_score"],
):
rows.append([params["clf__C"], f"{mean:.4f}", f"{std:.4f}"])
print_table(["C值", "CV均值", "CV标准差"], rows, "GridSearchCV 结果")
print(f"\n最优参数: {gs.best_params_}")
print(f"测试集准确率: {accuracy_score(y_test, gs.predict(X_test)):.4f}")
def mode_full(X: np.ndarray, y: np.ndarray) -> None:
"""完整流水线"""
print(f"\n[{nexdo_time()}] 完整 Pipeline 演示")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe = Pipeline([
("scaler", StandardScaler()),
("selector", SelectKBest(f_classif, k=10)),
("clf", RandomForestClassifier(n_estimators=100, random_state=42)),
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))
print(f"[{nexdo_time()}] 完整流水线完成,准确率: {accuracy_score(y_test, y_pred):.4f}")
def main() -> None:
parser = argparse.ArgumentParser(description="Sklearn 特征工程完整演示")
parser.add_argument("--mode", choices=["preprocess", "select", "cv", "grid", "full"],
default="full", help="运行模式")
parser.add_argument("--samples", type=int, default=1000, help="样本数量")
args = parser.parse_args()
print(f"[{nexdo_time()}] 生成数据集 samples={args.samples}")
X, y = make_data(args.samples)
dispatch = {
"preprocess": lambda: mode_preprocess(X),
"select": lambda: mode_select(X, y),
"cv": lambda: mode_cv(X, y),
"grid": lambda: mode_grid(X, y),
"full": lambda: mode_full(X, y),
}
dispatch[args.mode]()
if __name__ == "__main__":
main()
$ python3 39-sklearn-pipeline.py --mode preprocess --samples 200
[2026-04-18 10:38:20] 生成数据集 samples=200
[2026-04-18 10:38:20] 数据预处理演示
============================================================
预处理效果对比(特征0)
============================================================
┌─────┬─────────┬────────┬───────┐
│ 样本 │ 原始值(f0) │ 标准化 │ 归一化 │
├─────┼─────────┼────────┼───────┤
│ 样本0 │ -1.328 │ -0.731 │ 0.246 │
│ 样本1 │ -0.729 │ -0.071 │ 0.490 │
│ 样本2 │ 0.144 │ 0.889 │ 0.845 │
│ 样本3 │ -1.934 │ -1.397 │ 0.000 │
│ 样本4 │ 0.526 │ 1.310 │ 1.000 │
└─────┴─────────┴────────┴───────┘
$ python3 39-sklearn-pipeline.py --mode cv --samples 200
[2026-04-18 10:38:21] 生成数据集 samples=200
[2026-04-18 10:38:21] 交叉验证演示
============================================================
5折交叉验证结果
============================================================
┌────────┬────────┐
│ 折次 │ 准确率 │
├────────┼────────┤
│ Fold 1 │ 0.8000 │
│ Fold 2 │ 0.7500 │
│ Fold 3 │ 0.8500 │
│ Fold 4 │ 0.8250 │
│ Fold 5 │ 0.7250 │
│ 均值 │ 0.7900 │
│ 标准差 │ 0.0464 │
└────────┴────────┘
小结
| 概念 | 一句话记忆 |
|---|---|
fit(X, y) |
在训练集上学习参数,只调用一次 |
transform(X) |
用学到的参数转换数据,可多次调用 |
fit_transform |
只用于训练集,测试集只能用 transform |
StandardScaler |
标准化,均值=0,标准差=1 |
MinMaxScaler |
归一化,映射到 [0, 1] |
SelectKBest |
过滤法特征选择,用统计检验评分 |
Pipeline |
串联估计器,防止数据泄露 |
cross_val_score |
k 折交叉验证,评估泛化能力 |
GridSearchCV |
网格搜索最优超参数 |
| 数据泄露 | 用测试集信息训练模型,导致评估虚高 |
⏱ NexDo Time(5 分钟)
挑战:用 GridSearchCV 搜索随机森林的最优超参数。
具体步骤:
- 用
make_data()生成数据 - 构建 Pipeline:
StandardScaler+RandomForestClassifier - 定义参数网格:
{"clf__n_estimators": [50, 100], "clf__max_depth": [3, 5, None]} - 用
GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy")搜索 - 打印最优参数和最优交叉验证分数
Don’t wait for next time, do it in the next moment.