41 · 分类树与随机森林:构建决策系统
🔗 知识图谱导航:阅读本文前,建议先回顾《40 · 线性与逻辑回归:从预测到二分类》中的训练集/测试集、分类指标和过拟合概念;本文会把“分类判断”推进到更直观的树模型。
运行环境:
pip install numpy scikit-learn。本文所有数据都由 sklearn 在本地生成,不需要下载文件,也不依赖外部服务。
痛点与架构:逻辑回归像一条直线在做判断,但很多业务规则并不是一条直线能说清楚。决策树像“按条件一路问问题”的流程图,随机森林则像让很多棵树一起投票。本文目标是让你能跑通树模型、看懂特征重要性,并知道树太深为什么会过拟合。
树模型先建立直觉
决策树:一个人按规则问问题,容易解释,但容易记住训练集细节。
随机森林:很多棵树投票,单棵树可能偏,但集体结果更稳。
梯度提升:一棵树接一棵树补前面的错,通常精度高,但更需要调参。
极客解析:决策树像客服分流脚本:“如果年龄小于 30,再看收入;如果收入高,再看历史行为”。随机森林像专家委员会,每个专家看到的数据略有不同,最后投票决定分类。
步步为营:核心逻辑自适应拆解
这一篇拆成 6 个台阶:先造分类数据,再学会打印表格,然后分别看模型对比、特征重要性、树深度过拟合,最后用 argparse 把所有功能串成可运行脚本。
Step 1:用 make_data 造一份可重复的分类训练集
痛点与机制:
make_data 是本篇的本地数据工厂。它用 make_classification() 造出 15 个特征,其中 8 个真正有信息、4 个是冗余线索。你可以把它理解成考试题:有些题真的能区分水平,有些题只是重复问法。固定 random_state=42 是为了每次运行结果一致,方便新手对照。
核心源码(逐字来自文末完整源码):
def make_data() -> Tuple[np.ndarray, np.ndarray, List[str]]:
feature_names = [f"feat_{i:02d}" for i in range(15)]
X, y = make_classification(
n_samples=2000, n_features=15, n_informative=8,
n_redundant=4, random_state=42
)
return X, y, feature_names
可运行演示(补齐 Mock 数据与 print 反馈):
from typing import List, Tuple
import numpy as np
from sklearn.datasets import make_classification
def make_data() -> Tuple[np.ndarray, np.ndarray, List[str]]:
feature_names = [f"feat_{i:02d}" for i in range(15)]
X, y = make_classification(
n_samples=2000, n_features=15, n_informative=8,
n_redundant=4, random_state=42
)
return X, y, feature_names
X, y, feature_names = make_data()
print("数据形状 X:", X.shape)
print("标签形状 y:", y.shape)
print("前5个特征名:", feature_names[:5])
print("类别分布:", dict(zip(*np.unique(y, return_counts=True))))
print("第一行前5个特征:", np.round(X[0, :5], 3).tolist())
Step 2:用 print_table 把模型结果整理成终端成绩单
痛点与机制:
print_table 负责把模型名字、准确率、标准差等信息排整齐。机器学习实验最怕结果散在屏幕上像流水账;表格就像成绩单,让你一眼看出谁稳定、谁过拟合、谁更值得继续调参。
核心源码(逐字来自文末完整源码):
def print_table(headers: list, rows: list, title: str = "") -> None:
if title:
print(f"\n{'='*65}\n {title}\n{'='*65}")
col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
for i, h in enumerate(headers)]
print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
for row in rows:
print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")
可运行演示(补齐 Mock 数据与 print 反馈):
def print_table(headers: list, rows: list, title: str = "") -> None:
if title:
print(f"\n{'='*65}\n {title}\n{'='*65}")
col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
for i, h in enumerate(headers)]
print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
for row in rows:
print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")
print_table(
["模型", "像什么", "优点"],
[
["决策树", "一张规则问答表", "解释清楚"],
["随机森林", "一群树投票", "更稳"],
["梯度提升", "错题本迭代", "精度常更高"],
],
"树模型家族速记",
)
Step 3:用 mode_compare 对比单棵树、随机森林和梯度提升
痛点与机制:
mode_compare 用 5 折交叉验证衡量泛化能力。单棵树像一个专家拍板,随机森林像一群专家投票,梯度提升像不断订正错题。交叉验证则像把试卷分成 5 份轮流考试,避免只靠一次测试碰运气。
核心源码(逐字来自文末完整源码):
def mode_compare(X: np.ndarray, y: np.ndarray) -> None:
print(f"[{nexdo_time()}] 三模型对比")
models = [
("决策树(depth=3)", DecisionTreeClassifier(max_depth=3, random_state=42)),
("决策树(depth=10)", DecisionTreeClassifier(max_depth=10, random_state=42)),
("决策树(无限制)", DecisionTreeClassifier(random_state=42)),
("随机森林(100棵)", RandomForestClassifier(n_estimators=100, random_state=42)),
("梯度提升(100轮)", GradientBoostingClassifier(n_estimators=100, random_state=42)),
]
rows = []
for name, clf in models:
cv_scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
rows.append([name, f"{cv_scores.mean():.4f}", f"{cv_scores.std():.4f}"])
print_table(["模型", "CV均值准确率", "CV标准差"], rows, "模型对比(5折交叉验证)")
可运行演示(补齐 Mock 数据与 print 反馈):
import time
from typing import List, Tuple
import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
def nexdo_time() -> str:
return time.strftime("%Y-%m-%d %H:%M:%S")
def print_table(headers: list, rows: list, title: str = "") -> None:
if title:
print(f"\n{'='*65}\n {title}\n{'='*65}")
col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
for i, h in enumerate(headers)]
print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
for row in rows:
print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")
X, y = make_classification(n_samples=400, n_features=15, n_informative=8, n_redundant=4, random_state=42)
def mode_compare(X: np.ndarray, y: np.ndarray) -> None:
print(f"[{nexdo_time()}] 三模型对比")
models = [
("决策树(depth=3)", DecisionTreeClassifier(max_depth=3, random_state=42)),
("决策树(depth=10)", DecisionTreeClassifier(max_depth=10, random_state=42)),
("决策树(无限制)", DecisionTreeClassifier(random_state=42)),
("随机森林(100棵)", RandomForestClassifier(n_estimators=100, random_state=42)),
("梯度提升(100轮)", GradientBoostingClassifier(n_estimators=100, random_state=42)),
]
rows = []
for name, clf in models:
cv_scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
rows.append([name, f"{cv_scores.mean():.4f}", f"{cv_scores.std():.4f}"])
print_table(["模型", "CV均值准确率", "CV标准差"], rows, "模型对比(5折交叉验证)")
mode_compare(X, y)
Step 4:用 mode_importance 找出模型最看重的特征
痛点与机制:
feature_importances_ 可以告诉你模型分裂节点时最常用、贡献最大的是哪些特征。它像一份“判案线索权重表”:随机森林会综合很多棵树的意见,所以通常比单棵树更稳。ASCII 条形图让新手不用画图也能看懂排名。
核心源码(逐字来自文末完整源码):
def mode_importance(X: np.ndarray, y: np.ndarray, feature_names: List[str]) -> None:
print(f"[{nexdo_time()}] 特征重要性对比")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
models = {
"决策树": DecisionTreeClassifier(max_depth=5, random_state=42),
"随机森林": RandomForestClassifier(n_estimators=100, random_state=42),
"梯度提升": GradientBoostingClassifier(n_estimators=100, random_state=42),
}
importances = {}
for name, clf in models.items():
clf.fit(X_train, y_train)
importances[name] = clf.feature_importances_
# 按随机森林重要性排序
rf_imp = importances["随机森林"]
order = np.argsort(rf_imp)[::-1]
rows = []
for rank, idx in enumerate(order[:10]):
bar_rf = "█" * int(rf_imp[idx] * 100)
rows.append([
rank + 1,
feature_names[idx],
f"{importances['决策树'][idx]:.4f}",
f"{importances['随机森林'][idx]:.4f}",
f"{importances['梯度提升'][idx]:.4f}",
])
print_table(["排名", "特征", "决策树", "随机森林", "梯度提升"], rows, "特征重要性排名(Top10)")
# ASCII 条形图(随机森林)
print("\n 随机森林特征重要性(Top10 ASCII条形图)")
print(f" {'─'*55}")
for idx in order[:10]:
bar_len = int(rf_imp[idx] * 300)
bar = "█" * bar_len
print(f" {feature_names[idx]:10s} │{bar:<30} {rf_imp[idx]:.4f}")
print(f" {'─'*55}")
可运行演示(补齐 Mock 数据与 print 反馈):
import time
from typing import List, Tuple
import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
def nexdo_time() -> str:
return time.strftime("%Y-%m-%d %H:%M:%S")
def print_table(headers: list, rows: list, title: str = "") -> None:
if title:
print(f"\n{'='*65}\n {title}\n{'='*65}")
col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
for i, h in enumerate(headers)]
print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
for row in rows:
print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")
X, y = make_classification(n_samples=500, n_features=15, n_informative=8, n_redundant=4, random_state=42)
feature_names = [f"feat_{i:02d}" for i in range(15)]
def mode_importance(X: np.ndarray, y: np.ndarray, feature_names: List[str]) -> None:
print(f"[{nexdo_time()}] 特征重要性对比")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
models = {
"决策树": DecisionTreeClassifier(max_depth=5, random_state=42),
"随机森林": RandomForestClassifier(n_estimators=100, random_state=42),
"梯度提升": GradientBoostingClassifier(n_estimators=100, random_state=42),
}
importances = {}
for name, clf in models.items():
clf.fit(X_train, y_train)
importances[name] = clf.feature_importances_
# 按随机森林重要性排序
rf_imp = importances["随机森林"]
order = np.argsort(rf_imp)[::-1]
rows = []
for rank, idx in enumerate(order[:10]):
bar_rf = "█" * int(rf_imp[idx] * 100)
rows.append([
rank + 1,
feature_names[idx],
f"{importances['决策树'][idx]:.4f}",
f"{importances['随机森林'][idx]:.4f}",
f"{importances['梯度提升'][idx]:.4f}",
])
print_table(["排名", "特征", "决策树", "随机森林", "梯度提升"], rows, "特征重要性排名(Top10)")
# ASCII 条形图(随机森林)
print("\n 随机森林特征重要性(Top10 ASCII条形图)")
print(f" {'─'*55}")
for idx in order[:10]:
bar_len = int(rf_imp[idx] * 300)
bar = "█" * bar_len
print(f" {feature_names[idx]:10s} │{bar:<30} {rf_imp[idx]:.4f}")
print(f" {'─'*55}")
mode_importance(X, y, feature_names)
Step 5:用 mode_depth 识别树模型过拟合
痛点与机制:
树越深,规则越细,训练集分数往往越高,但测试集不一定更好。它像背答案:训练题能满分,换张卷子就露馅。mode_depth 同时打印训练准确率、测试准确率和差值,差值过大就标记过拟合。
核心源码(逐字来自文末完整源码):
def mode_depth(X: np.ndarray, y: np.ndarray) -> None:
"""树深度 vs 过拟合"""
print(f"[{nexdo_time()}] 树深度 vs 过拟合分析")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rows = []
for depth in [1, 2, 3, 5, 8, 10, 15, None]:
clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
clf.fit(X_train, y_train)
train_acc = accuracy_score(y_train, clf.predict(X_train))
test_acc = accuracy_score(y_test, clf.predict(X_test))
gap = train_acc - test_acc
flag = "⚠ 过拟合" if gap > 0.05 else "✓"
rows.append([str(depth) if depth else "无限制",
f"{train_acc:.4f}", f"{test_acc:.4f}", f"{gap:.4f}", flag])
print_table(["最大深度", "训练准确率", "测试准确率", "差值", "状态"], rows, "深度 vs 过拟合")
可运行演示(补齐 Mock 数据与 print 反馈):
import time
import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def nexdo_time() -> str:
return time.strftime("%Y-%m-%d %H:%M:%S")
def print_table(headers: list, rows: list, title: str = "") -> None:
if title:
print(f"\n{'='*65}\n {title}\n{'='*65}")
col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
for i, h in enumerate(headers)]
print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
for row in rows:
print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")
X, y = make_classification(n_samples=500, n_features=15, n_informative=8, n_redundant=4, random_state=42)
def mode_depth(X: np.ndarray, y: np.ndarray) -> None:
"""树深度 vs 过拟合"""
print(f"[{nexdo_time()}] 树深度 vs 过拟合分析")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rows = []
for depth in [1, 2, 3, 5, 8, 10, 15, None]:
clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
clf.fit(X_train, y_train)
train_acc = accuracy_score(y_train, clf.predict(X_train))
test_acc = accuracy_score(y_test, clf.predict(X_test))
gap = train_acc - test_acc
flag = "⚠ 过拟合" if gap > 0.05 else "✓"
rows.append([str(depth) if depth else "无限制",
f"{train_acc:.4f}", f"{test_acc:.4f}", f"{gap:.4f}", flag])
print_table(["最大深度", "训练准确率", "测试准确率", "差值", "状态"], rows, "深度 vs 过拟合")
mode_depth(X, y)
Step 6:用 main 做 compare/importance/depth/all 命令行调度
痛点与机制:
main 是脚本遥控器。新手不用改代码,只要换 --mode 参数,就能分别运行模型对比、特征重要性、深度过拟合分析,或用 all 一次跑完。
核心源码(逐字来自文末完整源码):
def main() -> None:
parser = argparse.ArgumentParser(description="决策树与集成模型演示")
parser.add_argument("--mode", choices=["compare", "importance", "depth", "all"],
default="all")
args = parser.parse_args()
X, y, feature_names = make_data()
if args.mode in ("compare", "all"):
mode_compare(X, y)
if args.mode in ("importance", "all"):
mode_importance(X, y, feature_names)
if args.mode in ("depth", "all"):
mode_depth(X, y)
print(f"\n[{nexdo_time()}] 完成")
可运行演示(补齐 Mock 数据与 print 反馈):
import argparse
import sys
import numpy as np
from typing import List, Tuple
def make_data() -> Tuple[np.ndarray, np.ndarray, List[str]]:
X = np.zeros((3, 2))
y = np.array([0, 1, 1])
return X, y, ["feat_00", "feat_01"]
def mode_compare(X: np.ndarray, y: np.ndarray) -> None:
print("compare:对比决策树、随机森林、梯度提升的交叉验证分数")
def mode_importance(X: np.ndarray, y: np.ndarray, feature_names: List[str]) -> None:
print("importance:输出每个特征对分类的贡献")
def mode_depth(X: np.ndarray, y: np.ndarray) -> None:
print("depth:观察树越深是否越容易过拟合")
def nexdo_time() -> str:
return "2026-04-18 10:50:00"
def main() -> None:
parser = argparse.ArgumentParser(description="决策树与集成模型演示")
parser.add_argument("--mode", choices=["compare", "importance", "depth", "all"],
default="all")
args = parser.parse_args()
X, y, feature_names = make_data()
if args.mode in ("compare", "all"):
mode_compare(X, y)
if args.mode in ("importance", "all"):
mode_importance(X, y, feature_names)
if args.mode in ("depth", "all"):
mode_depth(X, y)
print(f"\n[{nexdo_time()}] 完成")
for mode in ["compare", "importance", "depth", "all"]:
print(f"\n$ python3 41-tree-ensemble.py --mode {mode}")
sys.argv = ["prog", "--mode", mode]
main()
极客实战:完整源码与运行
现在,把上面的积木拼起来,将以下完整代码放进你的编辑器。建议先跑 --mode compare,再跑 --mode importance 和 --mode depth 分别观察模型稳定性、特征贡献和过拟合。
#!/usr/bin/env python3
"""
41-tree-ensemble.py
对比决策树/随机森林/梯度提升,ASCII特征重要性排名
用法:
python 41-tree-ensemble.py --mode compare
python 41-tree-ensemble.py --mode importance
python 41-tree-ensemble.py --mode depth
"""
import argparse
import time
from typing import List, Tuple
import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, f1_score
def nexdo_time() -> str:
return time.strftime("%Y-%m-%d %H:%M:%S")
def print_table(headers: list, rows: list, title: str = "") -> None:
if title:
print(f"\n{'='*65}\n {title}\n{'='*65}")
col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
for i, h in enumerate(headers)]
print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
for row in rows:
print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")
def make_data() -> Tuple[np.ndarray, np.ndarray, List[str]]:
feature_names = [f"feat_{i:02d}" for i in range(15)]
X, y = make_classification(
n_samples=2000, n_features=15, n_informative=8,
n_redundant=4, random_state=42
)
return X, y, feature_names
def mode_compare(X: np.ndarray, y: np.ndarray) -> None:
print(f"[{nexdo_time()}] 三模型对比")
models = [
("决策树(depth=3)", DecisionTreeClassifier(max_depth=3, random_state=42)),
("决策树(depth=10)", DecisionTreeClassifier(max_depth=10, random_state=42)),
("决策树(无限制)", DecisionTreeClassifier(random_state=42)),
("随机森林(100棵)", RandomForestClassifier(n_estimators=100, random_state=42)),
("梯度提升(100轮)", GradientBoostingClassifier(n_estimators=100, random_state=42)),
]
rows = []
for name, clf in models:
cv_scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
rows.append([name, f"{cv_scores.mean():.4f}", f"{cv_scores.std():.4f}"])
print_table(["模型", "CV均值准确率", "CV标准差"], rows, "模型对比(5折交叉验证)")
def mode_importance(X: np.ndarray, y: np.ndarray, feature_names: List[str]) -> None:
print(f"[{nexdo_time()}] 特征重要性对比")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
models = {
"决策树": DecisionTreeClassifier(max_depth=5, random_state=42),
"随机森林": RandomForestClassifier(n_estimators=100, random_state=42),
"梯度提升": GradientBoostingClassifier(n_estimators=100, random_state=42),
}
importances = {}
for name, clf in models.items():
clf.fit(X_train, y_train)
importances[name] = clf.feature_importances_
# 按随机森林重要性排序
rf_imp = importances["随机森林"]
order = np.argsort(rf_imp)[::-1]
rows = []
for rank, idx in enumerate(order[:10]):
bar_rf = "█" * int(rf_imp[idx] * 100)
rows.append([
rank + 1,
feature_names[idx],
f"{importances['决策树'][idx]:.4f}",
f"{importances['随机森林'][idx]:.4f}",
f"{importances['梯度提升'][idx]:.4f}",
])
print_table(["排名", "特征", "决策树", "随机森林", "梯度提升"], rows, "特征重要性排名(Top10)")
# ASCII 条形图(随机森林)
print("\n 随机森林特征重要性(Top10 ASCII条形图)")
print(f" {'─'*55}")
for idx in order[:10]:
bar_len = int(rf_imp[idx] * 300)
bar = "█" * bar_len
print(f" {feature_names[idx]:10s} │{bar:<30} {rf_imp[idx]:.4f}")
print(f" {'─'*55}")
def mode_depth(X: np.ndarray, y: np.ndarray) -> None:
"""树深度 vs 过拟合"""
print(f"[{nexdo_time()}] 树深度 vs 过拟合分析")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rows = []
for depth in [1, 2, 3, 5, 8, 10, 15, None]:
clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
clf.fit(X_train, y_train)
train_acc = accuracy_score(y_train, clf.predict(X_train))
test_acc = accuracy_score(y_test, clf.predict(X_test))
gap = train_acc - test_acc
flag = "⚠ 过拟合" if gap > 0.05 else "✓"
rows.append([str(depth) if depth else "无限制",
f"{train_acc:.4f}", f"{test_acc:.4f}", f"{gap:.4f}", flag])
print_table(["最大深度", "训练准确率", "测试准确率", "差值", "状态"], rows, "深度 vs 过拟合")
def main() -> None:
parser = argparse.ArgumentParser(description="决策树与集成模型演示")
parser.add_argument("--mode", choices=["compare", "importance", "depth", "all"],
default="all")
args = parser.parse_args()
X, y, feature_names = make_data()
if args.mode in ("compare", "all"):
mode_compare(X, y)
if args.mode in ("importance", "all"):
mode_importance(X, y, feature_names)
if args.mode in ("depth", "all"):
mode_depth(X, y)
print(f"\n[{nexdo_time()}] 完成")
if __name__ == "__main__":
main()
$ python3 41-tree-ensemble.py --mode compare
[2026-04-18 10:52:53] 三模型对比
=================================================================
模型对比(5折交叉验证)
=================================================================
┌───────────────┬─────────┬────────┐
│ 模型 │ CV均值准确率 │ CV标准差 │
├───────────────┼─────────┼────────┤
│ 决策树(depth=3) │ 0.7155 │ 0.0178 │
│ 决策树(depth=10) │ 0.7855 │ 0.0158 │
│ 决策树(无限制) │ 0.7850 │ 0.0157 │
│ 随机森林(100棵) │ 0.8700 │ 0.0185 │
│ 梯度提升(100轮) │ 0.8360 │ 0.0178 │
└───────────────┴─────────┴────────┘
[2026-04-18 10:52:56] 完成
$ python3 41-tree-ensemble.py --mode depth
[2026-04-18 10:52:57] 树深度 vs 过拟合分析
=================================================================
深度 vs 过拟合
=================================================================
┌──────┬────────┬────────┬─────────┬───────┐
│ 最大深度 │ 训练准确率 │ 测试准确率 │ 差值 │ 状态 │
├──────┼────────┼────────┼─────────┼───────┤
│ 1 │ 0.6212 │ 0.6275 │ -0.0062 │ ✓ │
│ 2 │ 0.6981 │ 0.7100 │ -0.0119 │ ✓ │
│ 3 │ 0.7444 │ 0.7400 │ 0.0044 │ ✓ │
│ 5 │ 0.7925 │ 0.7525 │ 0.0400 │ ✓ │
│ 8 │ 0.8869 │ 0.7725 │ 0.1144 │ ⚠ 过拟合 │
│ 10 │ 0.9319 │ 0.7675 │ 0.1644 │ ⚠ 过拟合 │
│ 15 │ 0.9806 │ 0.7700 │ 0.2106 │ ⚠ 过拟合 │
│ 无限制 │ 1.0000 │ 0.7650 │ 0.2350 │ ⚠ 过拟合 │
└──────┴────────┴────────┴─────────┴───────┘
小结
| 模块 | 你要记住什么 |
|---|---|
make_data |
生成稳定可复现的分类数据,避免外部文件依赖 |
mode_compare |
用交叉验证比较树、森林、提升树的泛化表现 |
mode_importance |
用特征重要性解释模型主要依赖哪些输入列 |
mode_depth |
用训练/测试准确率差值识别过拟合 |
main |
用 --mode 把实验拆成可单独运行的命令 |
⏱ NexDo Time(5 分钟)
挑战:把 mode_depth() 里的深度列表改成 [2, 4, 6, 8, 12, None],重新运行 --mode depth,观察哪一个深度测试集准确率最高。
Don’t wait for next time, do it in the next moment.