文章

41 · 分类树与随机森林:构建决策系统

#050 · 2026-04-17 · Python

🔗 知识图谱导航:阅读本文前,建议先回顾《40 · 线性与逻辑回归:从预测到二分类》中的训练集/测试集、分类指标和过拟合概念;本文会把“分类判断”推进到更直观的树模型。

运行环境pip install numpy scikit-learn。本文所有数据都由 sklearn 在本地生成,不需要下载文件,也不依赖外部服务。

痛点与架构:逻辑回归像一条直线在做判断,但很多业务规则并不是一条直线能说清楚。决策树像“按条件一路问问题”的流程图,随机森林则像让很多棵树一起投票。本文目标是让你能跑通树模型、看懂特征重要性,并知道树太深为什么会过拟合。

树模型先建立直觉

决策树:一个人按规则问问题,容易解释,但容易记住训练集细节。
随机森林:很多棵树投票,单棵树可能偏,但集体结果更稳。
梯度提升:一棵树接一棵树补前面的错,通常精度高,但更需要调参。

极客解析:决策树像客服分流脚本:“如果年龄小于 30,再看收入;如果收入高,再看历史行为”。随机森林像专家委员会,每个专家看到的数据略有不同,最后投票决定分类。

步步为营:核心逻辑自适应拆解

这一篇拆成 6 个台阶:先造分类数据,再学会打印表格,然后分别看模型对比、特征重要性、树深度过拟合,最后用 argparse 把所有功能串成可运行脚本。

Step 1:用 make_data 造一份可重复的分类训练集

痛点与机制

make_data 是本篇的本地数据工厂。它用 make_classification() 造出 15 个特征,其中 8 个真正有信息、4 个是冗余线索。你可以把它理解成考试题:有些题真的能区分水平,有些题只是重复问法。固定 random_state=42 是为了每次运行结果一致,方便新手对照。

核心源码(逐字来自文末完整源码)

def make_data() -> Tuple[np.ndarray, np.ndarray, List[str]]:
    feature_names = [f"feat_{i:02d}" for i in range(15)]
    X, y = make_classification(
        n_samples=2000, n_features=15, n_informative=8,
        n_redundant=4, random_state=42
    )
    return X, y, feature_names

可运行演示(补齐 Mock 数据与 print 反馈)

from typing import List, Tuple
import numpy as np
from sklearn.datasets import make_classification

def make_data() -> Tuple[np.ndarray, np.ndarray, List[str]]:
    feature_names = [f"feat_{i:02d}" for i in range(15)]
    X, y = make_classification(
        n_samples=2000, n_features=15, n_informative=8,
        n_redundant=4, random_state=42
    )
    return X, y, feature_names

X, y, feature_names = make_data()
print("数据形状 X:", X.shape)
print("标签形状 y:", y.shape)
print("前5个特征名:", feature_names[:5])
print("类别分布:", dict(zip(*np.unique(y, return_counts=True))))
print("第一行前5个特征:", np.round(X[0, :5], 3).tolist())

Step 2:用 print_table 把模型结果整理成终端成绩单

痛点与机制

print_table 负责把模型名字、准确率、标准差等信息排整齐。机器学习实验最怕结果散在屏幕上像流水账;表格就像成绩单,让你一眼看出谁稳定、谁过拟合、谁更值得继续调参。

核心源码(逐字来自文末完整源码)

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")

可运行演示(补齐 Mock 数据与 print 反馈)

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")

print_table(
    ["模型", "像什么", "优点"],
    [
        ["决策树", "一张规则问答表", "解释清楚"],
        ["随机森林", "一群树投票", "更稳"],
        ["梯度提升", "错题本迭代", "精度常更高"],
    ],
    "树模型家族速记",
)

Step 3:用 mode_compare 对比单棵树、随机森林和梯度提升

痛点与机制

mode_compare 用 5 折交叉验证衡量泛化能力。单棵树像一个专家拍板,随机森林像一群专家投票,梯度提升像不断订正错题。交叉验证则像把试卷分成 5 份轮流考试,避免只靠一次测试碰运气。

核心源码(逐字来自文末完整源码)

def mode_compare(X: np.ndarray, y: np.ndarray) -> None:
    print(f"[{nexdo_time()}] 三模型对比")
    models = [
        ("决策树(depth=3)",  DecisionTreeClassifier(max_depth=3, random_state=42)),
        ("决策树(depth=10)", DecisionTreeClassifier(max_depth=10, random_state=42)),
        ("决策树(无限制)",   DecisionTreeClassifier(random_state=42)),
        ("随机森林(100棵)",  RandomForestClassifier(n_estimators=100, random_state=42)),
        ("梯度提升(100轮)",  GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ]
    rows = []
    for name, clf in models:
        cv_scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
        rows.append([name, f"{cv_scores.mean():.4f}", f"{cv_scores.std():.4f}"])
    print_table(["模型", "CV均值准确率", "CV标准差"], rows, "模型对比(5折交叉验证)")

可运行演示(补齐 Mock 数据与 print 反馈)

import time
from typing import List, Tuple
import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score


def nexdo_time() -> str:
    return time.strftime("%Y-%m-%d %H:%M:%S")

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")


X, y = make_classification(n_samples=400, n_features=15, n_informative=8, n_redundant=4, random_state=42)

def mode_compare(X: np.ndarray, y: np.ndarray) -> None:
    print(f"[{nexdo_time()}] 三模型对比")
    models = [
        ("决策树(depth=3)",  DecisionTreeClassifier(max_depth=3, random_state=42)),
        ("决策树(depth=10)", DecisionTreeClassifier(max_depth=10, random_state=42)),
        ("决策树(无限制)",   DecisionTreeClassifier(random_state=42)),
        ("随机森林(100棵)",  RandomForestClassifier(n_estimators=100, random_state=42)),
        ("梯度提升(100轮)",  GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ]
    rows = []
    for name, clf in models:
        cv_scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
        rows.append([name, f"{cv_scores.mean():.4f}", f"{cv_scores.std():.4f}"])
    print_table(["模型", "CV均值准确率", "CV标准差"], rows, "模型对比(5折交叉验证)")

mode_compare(X, y)

Step 4:用 mode_importance 找出模型最看重的特征

痛点与机制

feature_importances_ 可以告诉你模型分裂节点时最常用、贡献最大的是哪些特征。它像一份“判案线索权重表”:随机森林会综合很多棵树的意见,所以通常比单棵树更稳。ASCII 条形图让新手不用画图也能看懂排名。

核心源码(逐字来自文末完整源码)

def mode_importance(X: np.ndarray, y: np.ndarray, feature_names: List[str]) -> None:
    print(f"[{nexdo_time()}] 特征重要性对比")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    models = {
        "决策树":   DecisionTreeClassifier(max_depth=5, random_state=42),
        "随机森林": RandomForestClassifier(n_estimators=100, random_state=42),
        "梯度提升": GradientBoostingClassifier(n_estimators=100, random_state=42),
    }
    importances = {}
    for name, clf in models.items():
        clf.fit(X_train, y_train)
        importances[name] = clf.feature_importances_

    # 按随机森林重要性排序
    rf_imp = importances["随机森林"]
    order = np.argsort(rf_imp)[::-1]

    rows = []
    for rank, idx in enumerate(order[:10]):
        bar_rf = "█" * int(rf_imp[idx] * 100)
        rows.append([
            rank + 1,
            feature_names[idx],
            f"{importances['决策树'][idx]:.4f}",
            f"{importances['随机森林'][idx]:.4f}",
            f"{importances['梯度提升'][idx]:.4f}",
        ])
    print_table(["排名", "特征", "决策树", "随机森林", "梯度提升"], rows, "特征重要性排名(Top10)")

    # ASCII 条形图(随机森林)
    print("\n  随机森林特征重要性(Top10 ASCII条形图)")
    print(f"  {'─'*55}")
    for idx in order[:10]:
        bar_len = int(rf_imp[idx] * 300)
        bar = "█" * bar_len
        print(f"  {feature_names[idx]:10s} │{bar:<30} {rf_imp[idx]:.4f}")
    print(f"  {'─'*55}")

可运行演示(补齐 Mock 数据与 print 反馈)

import time
from typing import List, Tuple
import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split


def nexdo_time() -> str:
    return time.strftime("%Y-%m-%d %H:%M:%S")

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")

X, y = make_classification(n_samples=500, n_features=15, n_informative=8, n_redundant=4, random_state=42)
feature_names = [f"feat_{i:02d}" for i in range(15)]

def mode_importance(X: np.ndarray, y: np.ndarray, feature_names: List[str]) -> None:
    print(f"[{nexdo_time()}] 特征重要性对比")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    models = {
        "决策树":   DecisionTreeClassifier(max_depth=5, random_state=42),
        "随机森林": RandomForestClassifier(n_estimators=100, random_state=42),
        "梯度提升": GradientBoostingClassifier(n_estimators=100, random_state=42),
    }
    importances = {}
    for name, clf in models.items():
        clf.fit(X_train, y_train)
        importances[name] = clf.feature_importances_

    # 按随机森林重要性排序
    rf_imp = importances["随机森林"]
    order = np.argsort(rf_imp)[::-1]

    rows = []
    for rank, idx in enumerate(order[:10]):
        bar_rf = "█" * int(rf_imp[idx] * 100)
        rows.append([
            rank + 1,
            feature_names[idx],
            f"{importances['决策树'][idx]:.4f}",
            f"{importances['随机森林'][idx]:.4f}",
            f"{importances['梯度提升'][idx]:.4f}",
        ])
    print_table(["排名", "特征", "决策树", "随机森林", "梯度提升"], rows, "特征重要性排名(Top10)")

    # ASCII 条形图(随机森林)
    print("\n  随机森林特征重要性(Top10 ASCII条形图)")
    print(f"  {'─'*55}")
    for idx in order[:10]:
        bar_len = int(rf_imp[idx] * 300)
        bar = "█" * bar_len
        print(f"  {feature_names[idx]:10s}{bar:<30} {rf_imp[idx]:.4f}")
    print(f"  {'─'*55}")

mode_importance(X, y, feature_names)

Step 5:用 mode_depth 识别树模型过拟合

痛点与机制

树越深,规则越细,训练集分数往往越高,但测试集不一定更好。它像背答案:训练题能满分,换张卷子就露馅。mode_depth 同时打印训练准确率、测试准确率和差值,差值过大就标记过拟合。

核心源码(逐字来自文末完整源码)

def mode_depth(X: np.ndarray, y: np.ndarray) -> None:
    """树深度 vs 过拟合"""
    print(f"[{nexdo_time()}] 树深度 vs 过拟合分析")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    rows = []
    for depth in [1, 2, 3, 5, 8, 10, 15, None]:
        clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
        clf.fit(X_train, y_train)
        train_acc = accuracy_score(y_train, clf.predict(X_train))
        test_acc = accuracy_score(y_test, clf.predict(X_test))
        gap = train_acc - test_acc
        flag = "⚠ 过拟合" if gap > 0.05 else "✓"
        rows.append([str(depth) if depth else "无限制",
                     f"{train_acc:.4f}", f"{test_acc:.4f}", f"{gap:.4f}", flag])
    print_table(["最大深度", "训练准确率", "测试准确率", "差值", "状态"], rows, "深度 vs 过拟合")

可运行演示(补齐 Mock 数据与 print 反馈)

import time
import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


def nexdo_time() -> str:
    return time.strftime("%Y-%m-%d %H:%M:%S")

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")

X, y = make_classification(n_samples=500, n_features=15, n_informative=8, n_redundant=4, random_state=42)

def mode_depth(X: np.ndarray, y: np.ndarray) -> None:
    """树深度 vs 过拟合"""
    print(f"[{nexdo_time()}] 树深度 vs 过拟合分析")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    rows = []
    for depth in [1, 2, 3, 5, 8, 10, 15, None]:
        clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
        clf.fit(X_train, y_train)
        train_acc = accuracy_score(y_train, clf.predict(X_train))
        test_acc = accuracy_score(y_test, clf.predict(X_test))
        gap = train_acc - test_acc
        flag = "⚠ 过拟合" if gap > 0.05 else "✓"
        rows.append([str(depth) if depth else "无限制",
                     f"{train_acc:.4f}", f"{test_acc:.4f}", f"{gap:.4f}", flag])
    print_table(["最大深度", "训练准确率", "测试准确率", "差值", "状态"], rows, "深度 vs 过拟合")

mode_depth(X, y)

Step 6:用 main 做 compare/importance/depth/all 命令行调度

痛点与机制

main 是脚本遥控器。新手不用改代码,只要换 --mode 参数,就能分别运行模型对比、特征重要性、深度过拟合分析,或用 all 一次跑完。

核心源码(逐字来自文末完整源码)

def main() -> None:
    parser = argparse.ArgumentParser(description="决策树与集成模型演示")
    parser.add_argument("--mode", choices=["compare", "importance", "depth", "all"],
                        default="all")
    args = parser.parse_args()
    X, y, feature_names = make_data()
    if args.mode in ("compare", "all"):
        mode_compare(X, y)
    if args.mode in ("importance", "all"):
        mode_importance(X, y, feature_names)
    if args.mode in ("depth", "all"):
        mode_depth(X, y)
    print(f"\n[{nexdo_time()}] 完成")

可运行演示(补齐 Mock 数据与 print 反馈)

import argparse
import sys
import numpy as np
from typing import List, Tuple


def make_data() -> Tuple[np.ndarray, np.ndarray, List[str]]:
    X = np.zeros((3, 2))
    y = np.array([0, 1, 1])
    return X, y, ["feat_00", "feat_01"]


def mode_compare(X: np.ndarray, y: np.ndarray) -> None:
    print("compare:对比决策树、随机森林、梯度提升的交叉验证分数")


def mode_importance(X: np.ndarray, y: np.ndarray, feature_names: List[str]) -> None:
    print("importance:输出每个特征对分类的贡献")


def mode_depth(X: np.ndarray, y: np.ndarray) -> None:
    print("depth:观察树越深是否越容易过拟合")


def nexdo_time() -> str:
    return "2026-04-18 10:50:00"

def main() -> None:
    parser = argparse.ArgumentParser(description="决策树与集成模型演示")
    parser.add_argument("--mode", choices=["compare", "importance", "depth", "all"],
                        default="all")
    args = parser.parse_args()
    X, y, feature_names = make_data()
    if args.mode in ("compare", "all"):
        mode_compare(X, y)
    if args.mode in ("importance", "all"):
        mode_importance(X, y, feature_names)
    if args.mode in ("depth", "all"):
        mode_depth(X, y)
    print(f"\n[{nexdo_time()}] 完成")

for mode in ["compare", "importance", "depth", "all"]:
    print(f"\n$ python3 41-tree-ensemble.py --mode {mode}")
    sys.argv = ["prog", "--mode", mode]
    main()

极客实战:完整源码与运行

现在,把上面的积木拼起来,将以下完整代码放进你的编辑器。建议先跑 --mode compare,再跑 --mode importance--mode depth 分别观察模型稳定性、特征贡献和过拟合。

#!/usr/bin/env python3
"""
41-tree-ensemble.py
对比决策树/随机森林/梯度提升,ASCII特征重要性排名

用法:
  python 41-tree-ensemble.py --mode compare
  python 41-tree-ensemble.py --mode importance
  python 41-tree-ensemble.py --mode depth
"""

import argparse
import time
from typing import List, Tuple

import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, f1_score


def nexdo_time() -> str:
    return time.strftime("%Y-%m-%d %H:%M:%S")


def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")


def make_data() -> Tuple[np.ndarray, np.ndarray, List[str]]:
    feature_names = [f"feat_{i:02d}" for i in range(15)]
    X, y = make_classification(
        n_samples=2000, n_features=15, n_informative=8,
        n_redundant=4, random_state=42
    )
    return X, y, feature_names


def mode_compare(X: np.ndarray, y: np.ndarray) -> None:
    print(f"[{nexdo_time()}] 三模型对比")
    models = [
        ("决策树(depth=3)",  DecisionTreeClassifier(max_depth=3, random_state=42)),
        ("决策树(depth=10)", DecisionTreeClassifier(max_depth=10, random_state=42)),
        ("决策树(无限制)",   DecisionTreeClassifier(random_state=42)),
        ("随机森林(100棵)",  RandomForestClassifier(n_estimators=100, random_state=42)),
        ("梯度提升(100轮)",  GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ]
    rows = []
    for name, clf in models:
        cv_scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
        rows.append([name, f"{cv_scores.mean():.4f}", f"{cv_scores.std():.4f}"])
    print_table(["模型", "CV均值准确率", "CV标准差"], rows, "模型对比(5折交叉验证)")


def mode_importance(X: np.ndarray, y: np.ndarray, feature_names: List[str]) -> None:
    print(f"[{nexdo_time()}] 特征重要性对比")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    models = {
        "决策树":   DecisionTreeClassifier(max_depth=5, random_state=42),
        "随机森林": RandomForestClassifier(n_estimators=100, random_state=42),
        "梯度提升": GradientBoostingClassifier(n_estimators=100, random_state=42),
    }
    importances = {}
    for name, clf in models.items():
        clf.fit(X_train, y_train)
        importances[name] = clf.feature_importances_

    # 按随机森林重要性排序
    rf_imp = importances["随机森林"]
    order = np.argsort(rf_imp)[::-1]

    rows = []
    for rank, idx in enumerate(order[:10]):
        bar_rf = "█" * int(rf_imp[idx] * 100)
        rows.append([
            rank + 1,
            feature_names[idx],
            f"{importances['决策树'][idx]:.4f}",
            f"{importances['随机森林'][idx]:.4f}",
            f"{importances['梯度提升'][idx]:.4f}",
        ])
    print_table(["排名", "特征", "决策树", "随机森林", "梯度提升"], rows, "特征重要性排名(Top10)")

    # ASCII 条形图(随机森林)
    print("\n  随机森林特征重要性(Top10 ASCII条形图)")
    print(f"  {'─'*55}")
    for idx in order[:10]:
        bar_len = int(rf_imp[idx] * 300)
        bar = "█" * bar_len
        print(f"  {feature_names[idx]:10s}{bar:<30} {rf_imp[idx]:.4f}")
    print(f"  {'─'*55}")


def mode_depth(X: np.ndarray, y: np.ndarray) -> None:
    """树深度 vs 过拟合"""
    print(f"[{nexdo_time()}] 树深度 vs 过拟合分析")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    rows = []
    for depth in [1, 2, 3, 5, 8, 10, 15, None]:
        clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
        clf.fit(X_train, y_train)
        train_acc = accuracy_score(y_train, clf.predict(X_train))
        test_acc = accuracy_score(y_test, clf.predict(X_test))
        gap = train_acc - test_acc
        flag = "⚠ 过拟合" if gap > 0.05 else "✓"
        rows.append([str(depth) if depth else "无限制",
                     f"{train_acc:.4f}", f"{test_acc:.4f}", f"{gap:.4f}", flag])
    print_table(["最大深度", "训练准确率", "测试准确率", "差值", "状态"], rows, "深度 vs 过拟合")


def main() -> None:
    parser = argparse.ArgumentParser(description="决策树与集成模型演示")
    parser.add_argument("--mode", choices=["compare", "importance", "depth", "all"],
                        default="all")
    args = parser.parse_args()
    X, y, feature_names = make_data()
    if args.mode in ("compare", "all"):
        mode_compare(X, y)
    if args.mode in ("importance", "all"):
        mode_importance(X, y, feature_names)
    if args.mode in ("depth", "all"):
        mode_depth(X, y)
    print(f"\n[{nexdo_time()}] 完成")


if __name__ == "__main__":
    main()
$ python3 41-tree-ensemble.py --mode compare
[2026-04-18 10:52:53] 三模型对比

=================================================================
  模型对比(5折交叉验证)
=================================================================
┌───────────────┬─────────┬────────┐
│ 模型            │ CV均值准确率 │ CV标准差  │
├───────────────┼─────────┼────────┤
│ 决策树(depth=3)  │ 0.7155  │ 0.0178 │
│ 决策树(depth=10) │ 0.7855  │ 0.0158 │
│ 决策树(无限制)      │ 0.7850  │ 0.0157 │
│ 随机森林(100棵)    │ 0.8700  │ 0.0185 │
│ 梯度提升(100轮)    │ 0.8360  │ 0.0178 │
└───────────────┴─────────┴────────┘

[2026-04-18 10:52:56] 完成

$ python3 41-tree-ensemble.py --mode depth
[2026-04-18 10:52:57] 树深度 vs 过拟合分析

=================================================================
  深度 vs 过拟合
=================================================================
┌──────┬────────┬────────┬─────────┬───────┐
│ 最大深度 │ 训练准确率  │ 测试准确率  │ 差值      │ 状态    │
├──────┼────────┼────────┼─────────┼───────┤
1    │ 0.6212 │ 0.6275 │ -0.0062 │ ✓     │
2    │ 0.6981 │ 0.7100 │ -0.0119 │ ✓     │
3    │ 0.7444 │ 0.7400 │ 0.0044  │ ✓     │
5    │ 0.7925 │ 0.7525 │ 0.0400  │ ✓     │
8    │ 0.8869 │ 0.7725 │ 0.1144  │ ⚠ 过拟合 │
10   │ 0.9319 │ 0.7675 │ 0.1644  │ ⚠ 过拟合 │
15   │ 0.9806 │ 0.7700 │ 0.2106  │ ⚠ 过拟合 │
│ 无限制  │ 1.0000 │ 0.7650 │ 0.2350  │ ⚠ 过拟合 │
└──────┴────────┴────────┴─────────┴───────┘

小结

模块 你要记住什么
make_data 生成稳定可复现的分类数据,避免外部文件依赖
mode_compare 用交叉验证比较树、森林、提升树的泛化表现
mode_importance 用特征重要性解释模型主要依赖哪些输入列
mode_depth 用训练/测试准确率差值识别过拟合
main --mode 把实验拆成可单独运行的命令

⏱ NexDo Time(5 分钟)

挑战:把 mode_depth() 里的深度列表改成 [2, 4, 6, 8, 12, None],重新运行 --mode depth,观察哪一个深度测试集准确率最高。

Don’t wait for next time, do it in the next moment.