文章

39 · scikit-learn 实战：预处理、特征选择与交叉验证

#048 · 2026-04-17 · Python

Reading Path / ARTICLE 先抓主张，再转成行动 #048 · Python · 读完进入产品或下一篇

阅读数据加载中… 点赞数据加载中…

生成页数

幻灯片语言

提炼重点 / 自定义指令 (可选)

🔗 知识图谱导航：阅读本文前，建议先掌握《32 · Pandas 实战：数据清洗、聚合与时间序列》中的数据清洗和《35 · 线性代数实战：特征值、SVD 与最小二乘法》中的矩阵运算——scikit-learn 的底层是 NumPy 矩阵运算，输入输出都是 ndarray。 承上启下：上一篇我们在《38 · 音频信号处理：FFT 频谱分析与低通滤波》中，基于傅里叶变换与低通滤波算法对声波进行了时频转换和降噪特征处理。这些物理信号处理或金融时序建模所产生的复杂特征，需要通过科学的机器学习框架进行特征归一化与模型训练，才能真正释放其预测价值。本篇我们将正式踏入机器学习的殿堂，深入探讨 Python 工业级机器学习第一库 scikit-learn（sklearn）。我们将为你拆解其核心的“估计器接口”（Estimator API），演示如何进行标准差/归一化预处理、SelectKBest 与 RFE 特征选择，并利用 Pipeline（管道）这一核心利器构建严密的防护堤，彻底杜绝数据泄露（Data Leakage）痛点，在 5 折交叉验证与网格搜索中训练出真正具备强泛化能力的分类模型。

运行环境：pip install scikit-learn numpy。

极客解析：scikit-learn 的核心设计是"估计器接口"：所有对象都有 fit(X, y) 和 transform(X) 或 predict(X) 方法。Pipeline 把多个估计器串联，保证预处理只在训练集上 fit，防止数据泄露。

scikit-learn 核心接口

fit(X, y)          在训练集上学习参数
transform(X)       用学到的参数转换数据
fit_transform(X,y) fit + transform 合并（只用于训练集）
predict(X)         预测新数据的标签
score(X, y)        评估模型性能
Pipeline           串联多个估计器，防止数据泄露
cross_val_score    k 折交叉验证，评估泛化能力
GridSearchCV       网格搜索最优超参数

步步为营：核心逻辑自适应拆解

这一篇按 sklearn 机器学习流水线拆成 7 个台阶：造数据、打印实验表、预处理、特征选择、交叉验证、网格搜索，最后用 CLI 调度所有模式。每个演示都补了 Mock 数据和 print() 反馈。

Step 1：用 make_data 生成离线分类数据集

痛点与机制：

make_data 是机器学习的造数工厂，用 sklearn 内置函数生成分类数据，不需要联网下载。X 是特征表，像学生的多门成绩；y 是标签，像最终是否及格。后面所有预处理、选择、训练都围绕这两份数组展开。

核心源码（逐字来自文末完整源码）：

def make_data(n_samples: int = 1000, n_features: int = 20) -> Tuple[np.ndarray, np.ndarray]:
    X, y = make_classification(
        n_samples=n_samples, n_features=n_features,
        n_informative=10, n_redundant=5, random_state=42
    )
    return X, y

可运行演示（补齐 Mock 数据与 print 反馈）：

from typing import Tuple
import numpy as np
from sklearn.datasets import make_classification


def make_data(n_samples: int = 1000, n_features: int = 20) -> Tuple[np.ndarray, np.ndarray]:
    X, y = make_classification(
        n_samples=n_samples, n_features=n_features,
        n_informative=10, n_redundant=5, random_state=42
    )
    return X, y


X, y = make_data(n_samples=120, n_features=20)
print("📦 分类数据已生成")
print("X shape:", X.shape)
print("y shape:", y.shape)
print("类别分布:", dict(zip(*np.unique(y, return_counts=True))))
print("第一行特征:", np.round(X[0, :5], 3).tolist())

Step 2：用 print_table 把实验结果排成终端表格

痛点与机制：

print_table 是终端里的小报表工具。机器学习流程会产生很多对比结果，如果不排成表，新手很难判断哪一步更好。它像把实验记录贴进表格，让每个指标都有固定位置。

核心源码（逐字来自文末完整源码）：

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*60}")
        print(f"  {title}")
        print(f"{'='*60}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    sep = "┼".join("─" * (w + 2) for w in col_widths)
    header_line = "│".join(f" {str(h):<{w}} " for h, w in zip(headers, col_widths))
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{header_line}│")
    print(f"├{sep}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")

可运行演示（补齐 Mock 数据与 print 反馈）：

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*60}")
        print(f"  {title}")
        print(f"{'='*60}")
    col_widths = [max(len(str(h)), 12) for h in headers]
    for row in rows:
        for i, cell in enumerate(row):
            col_widths[i] = max(col_widths[i], len(str(cell)))
    fmt = "  " + "  ".join(f"{{:<{w}}}" for w in col_widths)
    print(fmt.format(*headers))
    print("  " + "  ".join("-" * w for w in col_widths))
    for row in rows:
        print(fmt.format(*row))


print_table(
    ["步骤", "作用", "新手理解"],
    [["Scaler", "统一量纲", "把身高体重换成同一把尺"], ["Model", "训练分类器", "让机器学会判断类别"]],
    title="Pipeline 组件说明",
)

Step 3：用 StandardScaler/MinMaxScaler 统一特征尺度

痛点与机制：

预处理是在训练前统一尺度。StandardScaler 把数据变成均值约 0、标准差约 1；MinMaxScaler 把数据压进 0 到 1。它像把厘米、米、公里统一成同一把尺，否则模型会被“大数字特征”带偏。

核心源码（逐字来自文末完整源码）：

def mode_preprocess(X: np.ndarray) -> None:
    """演示标准化与归一化"""
    print(f"\n[{nexdo_time()}] 数据预处理演示")
    sample = X[:5, :4]
    std = StandardScaler().fit_transform(sample)
    mm = MinMaxScaler().fit_transform(sample)
    rows = []
    for i in range(5):
        rows.append([
            f"样本{i}",
            f"{sample[i,0]:.3f}",
            f"{std[i,0]:.3f}",
            f"{mm[i,0]:.3f}",
        ])
    print_table(["样本", "原始值(f0)", "标准化", "归一化"], rows, "预处理效果对比（特征0）")

可运行演示（补齐 Mock 数据与 print 反馈）：

from typing import Tuple
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler, MinMaxScaler


def make_data(n_samples: int = 1000, n_features: int = 20) -> Tuple[np.ndarray, np.ndarray]:
    X, y = make_classification(n_samples=n_samples, n_features=n_features, n_informative=10, n_redundant=5, random_state=42)
    return X, y


def mode_preprocess(X: np.ndarray) -> None:
    std = StandardScaler().fit_transform(X)
    mm = MinMaxScaler().fit_transform(X)
    rows = [
        ["原始", f"{X[:,0].mean():.3f}", f"{X[:,0].std():.3f}", f"{X[:,0].min():.3f}~{X[:,0].max():.3f}"],
        ["Standard", f"{std[:,0].mean():.3f}", f"{std[:,0].std():.3f}", f"{std[:,0].min():.3f}~{std[:,0].max():.3f}"],
        ["MinMax", f"{mm[:,0].mean():.3f}", f"{mm[:,0].std():.3f}", f"{mm[:,0].min():.3f}~{mm[:,0].max():.3f}"],
    ]
    print("预处理对比（只看第0列）:")
    for row in rows:
        print("  ", row)


X, y = make_data(n_samples=120, n_features=20)
mode_preprocess(X)

Step 4：用 SelectKBest/RFE 选出更有用的特征

痛点与机制：

特征选择是在问：“哪些列最有用？” SelectKBest 像按单科成绩排名，直接挑统计分数高的列；RFE 像淘汰赛，训练模型后逐步删掉贡献小的特征。这样可以减少噪声，也让模型更容易解释。

核心源码（逐字来自文末完整源码）：

def mode_select(X: np.ndarray, y: np.ndarray) -> None:
    """演示特征选择"""
    print(f"\n[{nexdo_time()}] 特征选择演示")
    selector = SelectKBest(f_classif, k=5)
    selector.fit(X, y)
    scores = selector.scores_
    top5_idx = np.argsort(scores)[::-1][:5]
    rows = [(f"特征{i}", f"{scores[i]:.2f}", "✓" if i in top5_idx else "") for i in range(10)]
    print_table(["特征", "F分数", "入选Top5"], rows, "SelectKBest 特征评分（前10）")

    rfe = RFE(LogisticRegression(max_iter=500, random_state=42), n_features_to_select=5)
    rfe.fit(X, y)
    rows2 = [(f"特征{i}", rfe.ranking_[i], "✓" if rfe.support_[i] else "") for i in range(10)]
    print_table(["特征", "RFE排名", "入选"], rows2, "RFE 递归特征消除（前10）")

可运行演示（补齐 Mock 数据与 print 反馈）：

from typing import Tuple
import numpy as np
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.linear_model import LogisticRegression


def make_data(n_samples: int = 1000, n_features: int = 20) -> Tuple[np.ndarray, np.ndarray]:
    X, y = make_classification(n_samples=n_samples, n_features=n_features, n_informative=10, n_redundant=5, random_state=42)
    return X, y


def mode_select(X: np.ndarray, y: np.ndarray) -> None:
    skb = SelectKBest(f_classif, k=8).fit(X, y)
    scores = skb.scores_
    top_idx = np.argsort(scores)[-8:][::-1]
    print("SelectKBest Top 特征:")
    for idx in top_idx[:5]:
        print(f"  feature_{idx:<2} score={scores[idx]:.2f}")

    model = LogisticRegression(max_iter=1000)
    rfe = RFE(model, n_features_to_select=8).fit(X, y)
    print("RFE 选择特征:", np.where(rfe.support_)[0].tolist())


X, y = make_data(n_samples=160, n_features=16)
mode_select(X, y)

Step 5：用 Pipeline + cross_val_score 防止数据泄露

痛点与机制：

Pipeline + cross_val_score 是防数据泄露的关键。Scaler 必须只在训练折里学习均值和方差，不能提前看验证集。Pipeline 像密封流水线，保证每一折都按正确顺序做预处理和训练。

核心源码（逐字来自文末完整源码）：

def mode_cv(X: np.ndarray, y: np.ndarray) -> None:
    """演示交叉验证"""
    print(f"\n[{nexdo_time()}] 交叉验证演示")
    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("selector", SelectKBest(f_classif, k=10)),
        ("clf", LogisticRegression(max_iter=500, random_state=42)),
    ])
    scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")
    rows = [(f"Fold {i+1}", f"{s:.4f}") for i, s in enumerate(scores)]
    rows.append(["均值", f"{scores.mean():.4f}"])
    rows.append(["标准差", f"{scores.std():.4f}"])
    print_table(["折次", "准确率"], rows, "5折交叉验证结果")

可运行演示（补齐 Mock 数据与 print 反馈）：

from typing import Tuple
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


def make_data(n_samples: int = 1000, n_features: int = 20) -> Tuple[np.ndarray, np.ndarray]:
    X, y = make_classification(n_samples=n_samples, n_features=n_features, n_informative=10, n_redundant=5, random_state=42)
    return X, y


def mode_cv(X: np.ndarray, y: np.ndarray) -> None:
    pipe = Pipeline([
        ("scale", StandardScaler()),
        ("model", LogisticRegression(max_iter=1000))
    ])
    scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")
    print("5折交叉验证 accuracy:", np.round(scores, 4).tolist())
    print(f"平均={scores.mean():.4f}, 标准差={scores.std():.4f}")


X, y = make_data(n_samples=180, n_features=20)
mode_cv(X, y)

Step 6：用 GridSearchCV 搜索最佳超参数

痛点与机制：

GridSearchCV 是超参数试衣间。比如 LogisticRegression 的 C 控制正则强度，哪个值合适不能拍脑袋，要在交叉验证里逐个试。它会告诉你最好的参数和对应分数。

核心源码（逐字来自文末完整源码）：

def mode_grid(X: np.ndarray, y: np.ndarray) -> None:
    """演示网格搜索"""
    print(f"\n[{nexdo_time()}] 网格搜索演示")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression(max_iter=500, random_state=42)),
    ])
    param_grid = {"clf__C": [0.01, 0.1, 1.0, 10.0], "clf__penalty": ["l2"]}
    gs = GridSearchCV(pipe, param_grid, cv=3, scoring="accuracy", n_jobs=-1)
    gs.fit(X_train, y_train)
    rows = []
    for params, mean, std in zip(
        gs.cv_results_["params"],
        gs.cv_results_["mean_test_score"],
        gs.cv_results_["std_test_score"],
    ):
        rows.append([params["clf__C"], f"{mean:.4f}", f"{std:.4f}"])
    print_table(["C值", "CV均值", "CV标准差"], rows, "GridSearchCV 结果")
    print(f"\n最优参数: {gs.best_params_}")
    print(f"测试集准确率: {accuracy_score(y_test, gs.predict(X_test)):.4f}")

可运行演示（补齐 Mock 数据与 print 反馈）：

from typing import Tuple
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


def make_data(n_samples: int = 1000, n_features: int = 20) -> Tuple[np.ndarray, np.ndarray]:
    X, y = make_classification(n_samples=n_samples, n_features=n_features, n_informative=10, n_redundant=5, random_state=42)
    return X, y


def mode_grid(X: np.ndarray, y: np.ndarray) -> None:
    pipe = Pipeline([
        ("scale", StandardScaler()),
        ("model", LogisticRegression(max_iter=1000))
    ])
    params = {"model__C": [0.1, 1.0, 10.0]}
    grid = GridSearchCV(pipe, params, cv=3, scoring="accuracy")
    grid.fit(X, y)
    print("最佳参数:", grid.best_params_)
    print(f"最佳CV分数: {grid.best_score_:.4f}")


X, y = make_data(n_samples=180, n_features=20)
mode_grid(X, y)

Step 7：用 main 做 preprocess/select/cv/grid/full/all 脚本遥控器

痛点与机制：

main 是脚本遥控器：--mode preprocess/select/cv/grid/full/all 对应机器学习流水线的不同阶段。用户不用改代码，只要换参数就能观察每个环节。

核心源码（逐字来自文末完整源码）：

def main() -> None:
    parser = argparse.ArgumentParser(description="Sklearn 特征工程完整演示")
    parser.add_argument("--mode", choices=["preprocess", "select", "cv", "grid", "full"],
                        default="full", help="运行模式")
    parser.add_argument("--samples", type=int, default=1000, help="样本数量")
    args = parser.parse_args()

    print(f"[{nexdo_time()}] 生成数据集 samples={args.samples}")
    X, y = make_data(args.samples)

    dispatch = {
        "preprocess": lambda: mode_preprocess(X),
        "select":     lambda: mode_select(X, y),
        "cv":         lambda: mode_cv(X, y),
        "grid":       lambda: mode_grid(X, y),
        "full":       lambda: mode_full(X, y),
    }
    dispatch[args.mode]()

可运行演示（补齐 Mock 数据与 print 反馈）：

import argparse
import sys
import numpy as np
from sklearn.datasets import make_classification


def make_data():
    return make_classification(n_samples=80, n_features=8, n_informative=4, random_state=42)


def mode_preprocess(X):
    print("运行 preprocess：标准化 / 归一化", X.shape)


def mode_select(X, y):
    print("运行 select：选择最相关特征", X.shape[1], "列")


def mode_cv(X, y):
    print("运行 cv：Pipeline + cross_val_score")


def mode_grid(X, y):
    print("运行 grid：GridSearchCV 搜索超参数")


def mode_full(X, y):
    print("运行 full：训练并评估完整模型")


def main() -> None:
    parser = argparse.ArgumentParser(description="sklearn Pipeline 完整实战")
    parser.add_argument("--mode", choices=["preprocess", "select", "cv", "grid", "full", "all"], default="all")
    args = parser.parse_args()
    X, y = make_data()
    print("数据准备:", X.shape)
    if args.mode in ("preprocess", "all"):
        mode_preprocess(X)
    if args.mode in ("select", "all"):
        mode_select(X, y)
    if args.mode in ("cv", "all"):
        mode_cv(X, y)
    if args.mode in ("grid", "all"):
        mode_grid(X, y)
    if args.mode in ("full", "all"):
        mode_full(X, y)


for mode in ["preprocess", "select", "cv", "grid", "full", "all"]:
    print(f"\n>>> python3 39-sklearn-pipeline.py --mode {mode}")
    sys.argv = ["prog", "--mode", mode]
    main()

极客实战：完整源码与运行

现在，把上面的积木拼起来，将以下完整代码放进你的编辑器，运行它。先看整体闭环，再回头逐段改参数，你会更容易建立工程直觉。

#!/usr/bin/env python3
"""
39-sklearn-pipeline.py
完整演示：数据预处理 → 特征选择 → 模型训练 → 交叉验证 → 网格搜索

用法：
  python 39-sklearn-pipeline.py --mode preprocess
  python 39-sklearn-pipeline.py --mode select
  python 39-sklearn-pipeline.py --mode cv
  python 39-sklearn-pipeline.py --mode grid
  python 39-sklearn-pipeline.py --mode full
"""

import argparse
import time
from typing import Tuple

import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report


def nexdo_time() -> str:
    return time.strftime("%Y-%m-%d %H:%M:%S")


def make_data(n_samples: int = 1000, n_features: int = 20) -> Tuple[np.ndarray, np.ndarray]:
    X, y = make_classification(
        n_samples=n_samples, n_features=n_features,
        n_informative=10, n_redundant=5, random_state=42
    )
    return X, y


def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*60}")
        print(f"  {title}")
        print(f"{'='*60}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    sep = "┼".join("─" * (w + 2) for w in col_widths)
    header_line = "│".join(f" {str(h):<{w}} " for h, w in zip(headers, col_widths))
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{header_line}│")
    print(f"├{sep}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")


def mode_preprocess(X: np.ndarray) -> None:
    """演示标准化与归一化"""
    print(f"\n[{nexdo_time()}] 数据预处理演示")
    sample = X[:5, :4]
    std = StandardScaler().fit_transform(sample)
    mm = MinMaxScaler().fit_transform(sample)
    rows = []
    for i in range(5):
        rows.append([
            f"样本{i}",
            f"{sample[i,0]:.3f}",
            f"{std[i,0]:.3f}",
            f"{mm[i,0]:.3f}",
        ])
    print_table(["样本", "原始值(f0)", "标准化", "归一化"], rows, "预处理效果对比（特征0）")


def mode_select(X: np.ndarray, y: np.ndarray) -> None:
    """演示特征选择"""
    print(f"\n[{nexdo_time()}] 特征选择演示")
    selector = SelectKBest(f_classif, k=5)
    selector.fit(X, y)
    scores = selector.scores_
    top5_idx = np.argsort(scores)[::-1][:5]
    rows = [(f"特征{i}", f"{scores[i]:.2f}", "✓" if i in top5_idx else "") for i in range(10)]
    print_table(["特征", "F分数", "入选Top5"], rows, "SelectKBest 特征评分（前10）")

    rfe = RFE(LogisticRegression(max_iter=500, random_state=42), n_features_to_select=5)
    rfe.fit(X, y)
    rows2 = [(f"特征{i}", rfe.ranking_[i], "✓" if rfe.support_[i] else "") for i in range(10)]
    print_table(["特征", "RFE排名", "入选"], rows2, "RFE 递归特征消除（前10）")


def mode_cv(X: np.ndarray, y: np.ndarray) -> None:
    """演示交叉验证"""
    print(f"\n[{nexdo_time()}] 交叉验证演示")
    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("selector", SelectKBest(f_classif, k=10)),
        ("clf", LogisticRegression(max_iter=500, random_state=42)),
    ])
    scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")
    rows = [(f"Fold {i+1}", f"{s:.4f}") for i, s in enumerate(scores)]
    rows.append(["均值", f"{scores.mean():.4f}"])
    rows.append(["标准差", f"{scores.std():.4f}"])
    print_table(["折次", "准确率"], rows, "5折交叉验证结果")


def mode_grid(X: np.ndarray, y: np.ndarray) -> None:
    """演示网格搜索"""
    print(f"\n[{nexdo_time()}] 网格搜索演示")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression(max_iter=500, random_state=42)),
    ])
    param_grid = {"clf__C": [0.01, 0.1, 1.0, 10.0], "clf__penalty": ["l2"]}
    gs = GridSearchCV(pipe, param_grid, cv=3, scoring="accuracy", n_jobs=-1)
    gs.fit(X_train, y_train)
    rows = []
    for params, mean, std in zip(
        gs.cv_results_["params"],
        gs.cv_results_["mean_test_score"],
        gs.cv_results_["std_test_score"],
    ):
        rows.append([params["clf__C"], f"{mean:.4f}", f"{std:.4f}"])
    print_table(["C值", "CV均值", "CV标准差"], rows, "GridSearchCV 结果")
    print(f"\n最优参数: {gs.best_params_}")
    print(f"测试集准确率: {accuracy_score(y_test, gs.predict(X_test)):.4f}")


def mode_full(X: np.ndarray, y: np.ndarray) -> None:
    """完整流水线"""
    print(f"\n[{nexdo_time()}] 完整 Pipeline 演示")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("selector", SelectKBest(f_classif, k=10)),
        ("clf", RandomForestClassifier(n_estimators=100, random_state=42)),
    ])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    print(classification_report(y_test, y_pred))
    print(f"[{nexdo_time()}] 完整流水线完成，准确率: {accuracy_score(y_test, y_pred):.4f}")


def main() -> None:
    parser = argparse.ArgumentParser(description="Sklearn 特征工程完整演示")
    parser.add_argument("--mode", choices=["preprocess", "select", "cv", "grid", "full"],
                        default="full", help="运行模式")
    parser.add_argument("--samples", type=int, default=1000, help="样本数量")
    args = parser.parse_args()

    print(f"[{nexdo_time()}] 生成数据集 samples={args.samples}")
    X, y = make_data(args.samples)

    dispatch = {
        "preprocess": lambda: mode_preprocess(X),
        "select":     lambda: mode_select(X, y),
        "cv":         lambda: mode_cv(X, y),
        "grid":       lambda: mode_grid(X, y),
        "full":       lambda: mode_full(X, y),
    }
    dispatch[args.mode]()


if __name__ == "__main__":
    main()

$ python3 39-sklearn-pipeline.py --mode preprocess --samples 200

[2026-04-18 10:38:20] 生成数据集 samples=200

[2026-04-18 10:38:20] 数据预处理演示

============================================================
  预处理效果对比（特征0）
============================================================
┌─────┬─────────┬────────┬───────┐
│ 样本  │ 原始值(f0) │ 标准化    │ 归一化   │
├─────┼─────────┼────────┼───────┤
│ 样本0 │ -1.328  │ -0.731 │ 0.246 │
│ 样本1 │ -0.729  │ -0.071 │ 0.490 │
│ 样本2 │ 0.144   │ 0.889  │ 0.845 │
│ 样本3 │ -1.934  │ -1.397 │ 0.000 │
│ 样本4 │ 0.526   │ 1.310  │ 1.000 │
└─────┴─────────┴────────┴───────┘

$ python3 39-sklearn-pipeline.py --mode cv --samples 200

[2026-04-18 10:38:21] 生成数据集 samples=200

[2026-04-18 10:38:21] 交叉验证演示

============================================================
  5折交叉验证结果
============================================================
┌────────┬────────┐
│ 折次     │ 准确率    │
├────────┼────────┤
│ Fold 1 │ 0.8000 │
│ Fold 2 │ 0.7500 │
│ Fold 3 │ 0.8500 │
│ Fold 4 │ 0.8250 │
│ Fold 5 │ 0.7250 │
│ 均值     │ 0.7900 │
│ 标准差    │ 0.0464 │
└────────┴────────┘

小结

概念	一句话记忆
`fit(X, y)`	在训练集上学习参数，只调用一次
`transform(X)`	用学到的参数转换数据，可多次调用
`fit_transform`	只用于训练集，测试集只能用 `transform`
`StandardScaler`	标准化，均值=0，标准差=1
`MinMaxScaler`	归一化，映射到 [0, 1]
`SelectKBest`	过滤法特征选择，用统计检验评分
`Pipeline`	串联估计器，防止数据泄露
`cross_val_score`	k 折交叉验证，评估泛化能力
`GridSearchCV`	网格搜索最优超参数
数据泄露	用测试集信息训练模型，导致评估虚高

⏱ NexDo Time（5 分钟）

挑战：用 GridSearchCV 搜索随机森林的最优超参数。

具体步骤：

用 make_data() 生成数据
构建 Pipeline：StandardScaler + RandomForestClassifier
定义参数网格：{"clf__n_estimators": [50, 100], "clf__max_depth": [3, 5, None]}
用 GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy") 搜索
打印最优参数和最优交叉验证分数

Don’t wait for next time, do it in the next moment.

💡 下一篇预告：掌握了 scikit-learn 的预处理、特征选择、交叉验证与 Pipeline 防泄露机制后，你已经为机器学习建立起了工业级的工程防波堤。接下来，我们将正式进入具体算法的实战训练，从最经典、最具可解释性的回归模型开始。在下一篇《40 · 线性与逻辑回归：从预测到二分类》中，我们将深入剖析连续数值预测（线性回归）与离散二分类判别（逻辑回归）的底层原理，利用 sklearn 进行公式化的模型拟合、残差分析与决策边界绘制，拉开监督学习算法建模的序幕！