文章

41 · 分类树与随机森林：构建决策系统

#050 · 2026-04-17 · Python

Reading Path / ARTICLE 先抓主张，再转成行动 #050 · Python · 读完进入产品或下一篇

阅读数据加载中… 点赞数据加载中…

生成页数

幻灯片语言

提炼重点 / 自定义指令 (可选)

🔗 知识图谱导航：阅读本文前，建议先回顾《40 · 线性与逻辑回归：从预测到二分类》中的训练集/测试集、分类指标和过拟合概念；本文会把“分类判断”推进到更直观的树模型。 承上启下：上一篇我们在《40 · 线性与逻辑回归：从预测到二分类》中，探讨了参数化线性模型对连续值的预测与二分类边界的切割。然而，现实业务场景中的规则往往错综复杂，并非简单的线性映射能完全概括。本篇我们将突破线性边界的紧箍咒，深入探讨非参数化树模型的世界——分类树与随机森林。我们将为你揭示决策树在特征空间上的分支切分逻辑、从单棵树衍生到集成学习（随机森林、梯度提升）的演进过程，并编写代码评估树深度对过拟合（Overfitting）的敏感度，在终端输出精美的 ASCII 特征重要性（Feature Importance）条形图，构建一个清晰、可解释性极强的智能决策系统。

运行环境：pip install numpy scikit-learn。本文所有数据都由 sklearn 在本地生成，不需要下载文件，也不依赖外部服务。

痛点与架构：逻辑回归像一条直线在做判断，但很多业务规则并不是一条直线能说清楚。决策树像“按条件一路问问题”的流程图，随机森林则像让很多棵树一起投票。本文目标是让你能跑通树模型、看懂特征重要性，并知道树太深为什么会过拟合。

树模型先建立直觉

决策树：一个人按规则问问题，容易解释，但容易记住训练集细节。
随机森林：很多棵树投票，单棵树可能偏，但集体结果更稳。
梯度提升：一棵树接一棵树补前面的错，通常精度高，但更需要调参。

极客解析：决策树像客服分流脚本：“如果年龄小于 30，再看收入；如果收入高，再看历史行为”。随机森林像专家委员会，每个专家看到的数据略有不同，最后投票决定分类。

步步为营：核心逻辑自适应拆解

这一篇拆成 6 个台阶：先造分类数据，再学会打印表格，然后分别看模型对比、特征重要性、树深度过拟合，最后用 argparse 把所有功能串成可运行脚本。

Step 1：用 make_data 造一份可重复的分类训练集

痛点与机制：

make_data 是本篇的本地数据工厂。它用 make_classification() 造出 15 个特征，其中 8 个真正有信息、4 个是冗余线索。你可以把它理解成考试题：有些题真的能区分水平，有些题只是重复问法。固定 random_state=42 是为了每次运行结果一致，方便新手对照。

核心源码（逐字来自文末完整源码）：

def make_data() -> Tuple[np.ndarray, np.ndarray, List[str]]:
    feature_names = [f"feat_{i:02d}" for i in range(15)]
    X, y = make_classification(
        n_samples=2000, n_features=15, n_informative=8,
        n_redundant=4, random_state=42
    )
    return X, y, feature_names

可运行演示（补齐 Mock 数据与 print 反馈）：

from typing import List, Tuple
import numpy as np
from sklearn.datasets import make_classification

def make_data() -> Tuple[np.ndarray, np.ndarray, List[str]]:
    feature_names = [f"feat_{i:02d}" for i in range(15)]
    X, y = make_classification(
        n_samples=2000, n_features=15, n_informative=8,
        n_redundant=4, random_state=42
    )
    return X, y, feature_names

X, y, feature_names = make_data()
print("数据形状 X:", X.shape)
print("标签形状 y:", y.shape)
print("前5个特征名:", feature_names[:5])
print("类别分布:", dict(zip(*np.unique(y, return_counts=True))))
print("第一行前5个特征:", np.round(X[0, :5], 3).tolist())

Step 2：用 print_table 把模型结果整理成终端成绩单

痛点与机制：

print_table 负责把模型名字、准确率、标准差等信息排整齐。机器学习实验最怕结果散在屏幕上像流水账；表格就像成绩单，让你一眼看出谁稳定、谁过拟合、谁更值得继续调参。

核心源码（逐字来自文末完整源码）：

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")

可运行演示（补齐 Mock 数据与 print 反馈）：

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")

print_table(
    ["模型", "像什么", "优点"],
    [
        ["决策树", "一张规则问答表", "解释清楚"],
        ["随机森林", "一群树投票", "更稳"],
        ["梯度提升", "错题本迭代", "精度常更高"],
    ],
    "树模型家族速记",
)

Step 3：用 mode_compare 对比单棵树、随机森林和梯度提升

痛点与机制：

mode_compare 用 5 折交叉验证衡量泛化能力。单棵树像一个专家拍板，随机森林像一群专家投票，梯度提升像不断订正错题。交叉验证则像把试卷分成 5 份轮流考试，避免只靠一次测试碰运气。

核心源码（逐字来自文末完整源码）：

def mode_compare(X: np.ndarray, y: np.ndarray) -> None:
    print(f"[{nexdo_time()}] 三模型对比")
    models = [
        ("决策树(depth=3)",  DecisionTreeClassifier(max_depth=3, random_state=42)),
        ("决策树(depth=10)", DecisionTreeClassifier(max_depth=10, random_state=42)),
        ("决策树(无限制)",   DecisionTreeClassifier(random_state=42)),
        ("随机森林(100棵)",  RandomForestClassifier(n_estimators=100, random_state=42)),
        ("梯度提升(100轮)",  GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ]
    rows = []
    for name, clf in models:
        cv_scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
        rows.append([name, f"{cv_scores.mean():.4f}", f"{cv_scores.std():.4f}"])
    print_table(["模型", "CV均值准确率", "CV标准差"], rows, "模型对比（5折交叉验证）")

可运行演示（补齐 Mock 数据与 print 反馈）：

import time
from typing import List, Tuple
import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score


def nexdo_time() -> str:
    return time.strftime("%Y-%m-%d %H:%M:%S")

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")


X, y = make_classification(n_samples=400, n_features=15, n_informative=8, n_redundant=4, random_state=42)

def mode_compare(X: np.ndarray, y: np.ndarray) -> None:
    print(f"[{nexdo_time()}] 三模型对比")
    models = [
        ("决策树(depth=3)",  DecisionTreeClassifier(max_depth=3, random_state=42)),
        ("决策树(depth=10)", DecisionTreeClassifier(max_depth=10, random_state=42)),
        ("决策树(无限制)",   DecisionTreeClassifier(random_state=42)),
        ("随机森林(100棵)",  RandomForestClassifier(n_estimators=100, random_state=42)),
        ("梯度提升(100轮)",  GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ]
    rows = []
    for name, clf in models:
        cv_scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
        rows.append([name, f"{cv_scores.mean():.4f}", f"{cv_scores.std():.4f}"])
    print_table(["模型", "CV均值准确率", "CV标准差"], rows, "模型对比（5折交叉验证）")

mode_compare(X, y)

Step 4：用 mode_importance 找出模型最看重的特征

痛点与机制：

feature_importances_ 可以告诉你模型分裂节点时最常用、贡献最大的是哪些特征。它像一份“判案线索权重表”：随机森林会综合很多棵树的意见，所以通常比单棵树更稳。ASCII 条形图让新手不用画图也能看懂排名。

核心源码（逐字来自文末完整源码）：

def mode_importance(X: np.ndarray, y: np.ndarray, feature_names: List[str]) -> None:
    print(f"[{nexdo_time()}] 特征重要性对比")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    models = {
        "决策树":   DecisionTreeClassifier(max_depth=5, random_state=42),
        "随机森林": RandomForestClassifier(n_estimators=100, random_state=42),
        "梯度提升": GradientBoostingClassifier(n_estimators=100, random_state=42),
    }
    importances = {}
    for name, clf in models.items():
        clf.fit(X_train, y_train)
        importances[name] = clf.feature_importances_

    # 按随机森林重要性排序
    rf_imp = importances["随机森林"]
    order = np.argsort(rf_imp)[::-1]

    rows = []
    for rank, idx in enumerate(order[:10]):
        bar_rf = "█" * int(rf_imp[idx] * 100)
        rows.append([
            rank + 1,
            feature_names[idx],
            f"{importances['决策树'][idx]:.4f}",
            f"{importances['随机森林'][idx]:.4f}",
            f"{importances['梯度提升'][idx]:.4f}",
        ])
    print_table(["排名", "特征", "决策树", "随机森林", "梯度提升"], rows, "特征重要性排名（Top10）")

    # ASCII 条形图（随机森林）
    print("\n  随机森林特征重要性（Top10 ASCII条形图）")
    print(f"  {'─'*55}")
    for idx in order[:10]:
        bar_len = int(rf_imp[idx] * 300)
        bar = "█" * bar_len
        print(f"  {feature_names[idx]:10s} │{bar:<30} {rf_imp[idx]:.4f}")
    print(f"  {'─'*55}")

可运行演示（补齐 Mock 数据与 print 反馈）：

import time
from typing import List, Tuple
import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split


def nexdo_time() -> str:
    return time.strftime("%Y-%m-%d %H:%M:%S")

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")

X, y = make_classification(n_samples=500, n_features=15, n_informative=8, n_redundant=4, random_state=42)
feature_names = [f"feat_{i:02d}" for i in range(15)]

def mode_importance(X: np.ndarray, y: np.ndarray, feature_names: List[str]) -> None:
    print(f"[{nexdo_time()}] 特征重要性对比")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    models = {
        "决策树":   DecisionTreeClassifier(max_depth=5, random_state=42),
        "随机森林": RandomForestClassifier(n_estimators=100, random_state=42),
        "梯度提升": GradientBoostingClassifier(n_estimators=100, random_state=42),
    }
    importances = {}
    for name, clf in models.items():
        clf.fit(X_train, y_train)
        importances[name] = clf.feature_importances_

    # 按随机森林重要性排序
    rf_imp = importances["随机森林"]
    order = np.argsort(rf_imp)[::-1]

    rows = []
    for rank, idx in enumerate(order[:10]):
        bar_rf = "█" * int(rf_imp[idx] * 100)
        rows.append([
            rank + 1,
            feature_names[idx],
            f"{importances['决策树'][idx]:.4f}",
            f"{importances['随机森林'][idx]:.4f}",
            f"{importances['梯度提升'][idx]:.4f}",
        ])
    print_table(["排名", "特征", "决策树", "随机森林", "梯度提升"], rows, "特征重要性排名（Top10）")

    # ASCII 条形图（随机森林）
    print("\n  随机森林特征重要性（Top10 ASCII条形图）")
    print(f"  {'─'*55}")
    for idx in order[:10]:
        bar_len = int(rf_imp[idx] * 300)
        bar = "█" * bar_len
        print(f"  {feature_names[idx]:10s} │{bar:<30} {rf_imp[idx]:.4f}")
    print(f"  {'─'*55}")

mode_importance(X, y, feature_names)

Step 5：用 mode_depth 识别树模型过拟合

痛点与机制：

树越深，规则越细，训练集分数往往越高，但测试集不一定更好。它像背答案：训练题能满分，换张卷子就露馅。mode_depth 同时打印训练准确率、测试准确率和差值，差值过大就标记过拟合。

核心源码（逐字来自文末完整源码）：

def mode_depth(X: np.ndarray, y: np.ndarray) -> None:
    """树深度 vs 过拟合"""
    print(f"[{nexdo_time()}] 树深度 vs 过拟合分析")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    rows = []
    for depth in [1, 2, 3, 5, 8, 10, 15, None]:
        clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
        clf.fit(X_train, y_train)
        train_acc = accuracy_score(y_train, clf.predict(X_train))
        test_acc = accuracy_score(y_test, clf.predict(X_test))
        gap = train_acc - test_acc
        flag = "⚠ 过拟合" if gap > 0.05 else "✓"
        rows.append([str(depth) if depth else "无限制",
                     f"{train_acc:.4f}", f"{test_acc:.4f}", f"{gap:.4f}", flag])
    print_table(["最大深度", "训练准确率", "测试准确率", "差值", "状态"], rows, "深度 vs 过拟合")

可运行演示（补齐 Mock 数据与 print 反馈）：

import time
import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


def nexdo_time() -> str:
    return time.strftime("%Y-%m-%d %H:%M:%S")

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")

X, y = make_classification(n_samples=500, n_features=15, n_informative=8, n_redundant=4, random_state=42)

def mode_depth(X: np.ndarray, y: np.ndarray) -> None:
    """树深度 vs 过拟合"""
    print(f"[{nexdo_time()}] 树深度 vs 过拟合分析")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    rows = []
    for depth in [1, 2, 3, 5, 8, 10, 15, None]:
        clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
        clf.fit(X_train, y_train)
        train_acc = accuracy_score(y_train, clf.predict(X_train))
        test_acc = accuracy_score(y_test, clf.predict(X_test))
        gap = train_acc - test_acc
        flag = "⚠ 过拟合" if gap > 0.05 else "✓"
        rows.append([str(depth) if depth else "无限制",
                     f"{train_acc:.4f}", f"{test_acc:.4f}", f"{gap:.4f}", flag])
    print_table(["最大深度", "训练准确率", "测试准确率", "差值", "状态"], rows, "深度 vs 过拟合")

mode_depth(X, y)

Step 6：用 main 做 compare/importance/depth/all 命令行调度

痛点与机制：

main 是脚本遥控器。新手不用改代码，只要换 --mode 参数，就能分别运行模型对比、特征重要性、深度过拟合分析，或用 all 一次跑完。

核心源码（逐字来自文末完整源码）：

def main() -> None:
    parser = argparse.ArgumentParser(description="决策树与集成模型演示")
    parser.add_argument("--mode", choices=["compare", "importance", "depth", "all"],
                        default="all")
    args = parser.parse_args()
    X, y, feature_names = make_data()
    if args.mode in ("compare", "all"):
        mode_compare(X, y)
    if args.mode in ("importance", "all"):
        mode_importance(X, y, feature_names)
    if args.mode in ("depth", "all"):
        mode_depth(X, y)
    print(f"\n[{nexdo_time()}] 完成")

可运行演示（补齐 Mock 数据与 print 反馈）：

import argparse
import sys
import numpy as np
from typing import List, Tuple


def make_data() -> Tuple[np.ndarray, np.ndarray, List[str]]:
    X = np.zeros((3, 2))
    y = np.array([0, 1, 1])
    return X, y, ["feat_00", "feat_01"]


def mode_compare(X: np.ndarray, y: np.ndarray) -> None:
    print("compare：对比决策树、随机森林、梯度提升的交叉验证分数")


def mode_importance(X: np.ndarray, y: np.ndarray, feature_names: List[str]) -> None:
    print("importance：输出每个特征对分类的贡献")


def mode_depth(X: np.ndarray, y: np.ndarray) -> None:
    print("depth：观察树越深是否越容易过拟合")


def nexdo_time() -> str:
    return "2026-04-18 10:50:00"

def main() -> None:
    parser = argparse.ArgumentParser(description="决策树与集成模型演示")
    parser.add_argument("--mode", choices=["compare", "importance", "depth", "all"],
                        default="all")
    args = parser.parse_args()
    X, y, feature_names = make_data()
    if args.mode in ("compare", "all"):
        mode_compare(X, y)
    if args.mode in ("importance", "all"):
        mode_importance(X, y, feature_names)
    if args.mode in ("depth", "all"):
        mode_depth(X, y)
    print(f"\n[{nexdo_time()}] 完成")

for mode in ["compare", "importance", "depth", "all"]:
    print(f"\n$ python3 41-tree-ensemble.py --mode {mode}")
    sys.argv = ["prog", "--mode", mode]
    main()

极客实战：完整源码与运行

现在，把上面的积木拼起来，将以下完整代码放进你的编辑器。建议先跑 --mode compare，再跑 --mode importance 和 --mode depth 分别观察模型稳定性、特征贡献和过拟合。

#!/usr/bin/env python3
"""
41-tree-ensemble.py
对比决策树/随机森林/梯度提升，ASCII特征重要性排名

用法：
  python 41-tree-ensemble.py --mode compare
  python 41-tree-ensemble.py --mode importance
  python 41-tree-ensemble.py --mode depth
"""

import argparse
import time
from typing import List, Tuple

import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, f1_score


def nexdo_time() -> str:
    return time.strftime("%Y-%m-%d %H:%M:%S")


def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")


def make_data() -> Tuple[np.ndarray, np.ndarray, List[str]]:
    feature_names = [f"feat_{i:02d}" for i in range(15)]
    X, y = make_classification(
        n_samples=2000, n_features=15, n_informative=8,
        n_redundant=4, random_state=42
    )
    return X, y, feature_names


def mode_compare(X: np.ndarray, y: np.ndarray) -> None:
    print(f"[{nexdo_time()}] 三模型对比")
    models = [
        ("决策树(depth=3)",  DecisionTreeClassifier(max_depth=3, random_state=42)),
        ("决策树(depth=10)", DecisionTreeClassifier(max_depth=10, random_state=42)),
        ("决策树(无限制)",   DecisionTreeClassifier(random_state=42)),
        ("随机森林(100棵)",  RandomForestClassifier(n_estimators=100, random_state=42)),
        ("梯度提升(100轮)",  GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ]
    rows = []
    for name, clf in models:
        cv_scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
        rows.append([name, f"{cv_scores.mean():.4f}", f"{cv_scores.std():.4f}"])
    print_table(["模型", "CV均值准确率", "CV标准差"], rows, "模型对比（5折交叉验证）")


def mode_importance(X: np.ndarray, y: np.ndarray, feature_names: List[str]) -> None:
    print(f"[{nexdo_time()}] 特征重要性对比")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    models = {
        "决策树":   DecisionTreeClassifier(max_depth=5, random_state=42),
        "随机森林": RandomForestClassifier(n_estimators=100, random_state=42),
        "梯度提升": GradientBoostingClassifier(n_estimators=100, random_state=42),
    }
    importances = {}
    for name, clf in models.items():
        clf.fit(X_train, y_train)
        importances[name] = clf.feature_importances_

    # 按随机森林重要性排序
    rf_imp = importances["随机森林"]
    order = np.argsort(rf_imp)[::-1]

    rows = []
    for rank, idx in enumerate(order[:10]):
        bar_rf = "█" * int(rf_imp[idx] * 100)
        rows.append([
            rank + 1,
            feature_names[idx],
            f"{importances['决策树'][idx]:.4f}",
            f"{importances['随机森林'][idx]:.4f}",
            f"{importances['梯度提升'][idx]:.4f}",
        ])
    print_table(["排名", "特征", "决策树", "随机森林", "梯度提升"], rows, "特征重要性排名（Top10）")

    # ASCII 条形图（随机森林）
    print("\n  随机森林特征重要性（Top10 ASCII条形图）")
    print(f"  {'─'*55}")
    for idx in order[:10]:
        bar_len = int(rf_imp[idx] * 300)
        bar = "█" * bar_len
        print(f"  {feature_names[idx]:10s} │{bar:<30} {rf_imp[idx]:.4f}")
    print(f"  {'─'*55}")


def mode_depth(X: np.ndarray, y: np.ndarray) -> None:
    """树深度 vs 过拟合"""
    print(f"[{nexdo_time()}] 树深度 vs 过拟合分析")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    rows = []
    for depth in [1, 2, 3, 5, 8, 10, 15, None]:
        clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
        clf.fit(X_train, y_train)
        train_acc = accuracy_score(y_train, clf.predict(X_train))
        test_acc = accuracy_score(y_test, clf.predict(X_test))
        gap = train_acc - test_acc
        flag = "⚠ 过拟合" if gap > 0.05 else "✓"
        rows.append([str(depth) if depth else "无限制",
                     f"{train_acc:.4f}", f"{test_acc:.4f}", f"{gap:.4f}", flag])
    print_table(["最大深度", "训练准确率", "测试准确率", "差值", "状态"], rows, "深度 vs 过拟合")


def main() -> None:
    parser = argparse.ArgumentParser(description="决策树与集成模型演示")
    parser.add_argument("--mode", choices=["compare", "importance", "depth", "all"],
                        default="all")
    args = parser.parse_args()
    X, y, feature_names = make_data()
    if args.mode in ("compare", "all"):
        mode_compare(X, y)
    if args.mode in ("importance", "all"):
        mode_importance(X, y, feature_names)
    if args.mode in ("depth", "all"):
        mode_depth(X, y)
    print(f"\n[{nexdo_time()}] 完成")


if __name__ == "__main__":
    main()

$ python3 41-tree-ensemble.py --mode compare
[2026-04-18 10:52:53] 三模型对比

=================================================================
  模型对比（5折交叉验证）
=================================================================
┌───────────────┬─────────┬────────┐
│ 模型            │ CV均值准确率 │ CV标准差  │
├───────────────┼─────────┼────────┤
│ 决策树(depth=3)  │ 0.7155  │ 0.0178 │
│ 决策树(depth=10) │ 0.7855  │ 0.0158 │
│ 决策树(无限制)      │ 0.7850  │ 0.0157 │
│ 随机森林(100棵)    │ 0.8700  │ 0.0185 │
│ 梯度提升(100轮)    │ 0.8360  │ 0.0178 │
└───────────────┴─────────┴────────┘

[2026-04-18 10:52:56] 完成

$ python3 41-tree-ensemble.py --mode depth
[2026-04-18 10:52:57] 树深度 vs 过拟合分析

=================================================================
  深度 vs 过拟合
=================================================================
┌──────┬────────┬────────┬─────────┬───────┐
│ 最大深度 │ 训练准确率  │ 测试准确率  │ 差值      │ 状态    │
├──────┼────────┼────────┼─────────┼───────┤
│ 1    │ 0.6212 │ 0.6275 │ -0.0062 │ ✓     │
│ 2    │ 0.6981 │ 0.7100 │ -0.0119 │ ✓     │
│ 3    │ 0.7444 │ 0.7400 │ 0.0044  │ ✓     │
│ 5    │ 0.7925 │ 0.7525 │ 0.0400  │ ✓     │
│ 8    │ 0.8869 │ 0.7725 │ 0.1144  │ ⚠ 过拟合 │
│ 10   │ 0.9319 │ 0.7675 │ 0.1644  │ ⚠ 过拟合 │
│ 15   │ 0.9806 │ 0.7700 │ 0.2106  │ ⚠ 过拟合 │
│ 无限制  │ 1.0000 │ 0.7650 │ 0.2350  │ ⚠ 过拟合 │
└──────┴────────┴────────┴─────────┴───────┘

小结

模块	你要记住什么
`make_data`	生成稳定可复现的分类数据，避免外部文件依赖
`mode_compare`	用交叉验证比较树、森林、提升树的泛化表现
`mode_importance`	用特征重要性解释模型主要依赖哪些输入列
`mode_depth`	用训练/测试准确率差值识别过拟合
`main`	用 `--mode` 把实验拆成可单独运行的命令

⏱ NexDo Time（5 分钟）

挑战：把 mode_depth() 里的深度列表改成 [2, 4, 6, 8, 12, None]，重新运行 --mode depth，观察哪一个深度测试集准确率最高。

Don’t wait for next time, do it in the next moment.

💡 下一篇预告：掌握了决策树、随机森林与梯度提升集成学习模型后，你已经攻克了监督学习中极其关键的分类决策系统。然而，现实世界中还存在大量没有人工标注（没有 y 值）的数据，我们该如何归纳它们的共性并发现异常？在下一篇《42 · 无监督学习：K-Means 聚类与异常检测》中，我们将跨入无监督学习的新领域，学习如何用 K-Means 算法对数据进行自动聚合归类，并利用孤立森林（Isolation Forest）算法在海量特征中揪出隐藏的异常样本！