49 · MNIST 手写数字识别:全连接网络实战
🔗 知识图谱导航:阅读本文前,建议先回顾《48 · 神经网络基础:从零手写前向传播与反向传播》里的 MLP、Softmax、交叉熵和 Mini-batch 训练。本文会把那套手写网络用到手写数字识别任务上。 NexDo Time · 2026-04-17 · 预计阅读 30 分钟
痛点与架构
手写数字识别是深度学习的经典入门任务:给模型一张小图片,让它判断图片里写的是 0 到 9 哪个数字。很多教程一上来就下载 MNIST、调用框架训练,读者只看到准确率,却看不懂像素、标签、网络输出之间怎么连起来。
本文用 sklearn 内置的 digits 数据集替代完整 MNIST。它每张图是 8x8 像素,共 1797 个样本,不需要联网下载,适合在普通电脑上快速跑通。流程很清楚:图片展平成 64 维向量,MLP 输出 10 个概率,概率最大的下标就是预测数字。
8x8 图片 -> 展平 64 维 -> MLP(64 -> 128 -> 64 -> 10)
-> Softmax 概率 -> argmax -> 预测数字
-> 混淆矩阵 -> 看清楚哪些数字容易认错
步步为营:核心逻辑自适应拆解
这一篇从“像素怎么进入网络”讲到“混淆矩阵怎么分析错误”,拆成 8 个小步骤。每一步都可以单独复制运行,先看到输出,再理解原理。
Step 1:用 onehot 和 softmax 把数字标签变成可训练目标
痛点与机制:
手写数字识别是 10 分类问题,标签 7 不能直接拿去和 10 个输出概率相乘,所以要先变成 one-hot。你可以把 one-hot 想成答题卡:正确数字那一格涂黑,其余空着。Softmax 再把模型的 10 个原始打分变成概率,交叉熵负责计算“这张答题卡错得有多离谱”。
核心源码(逐字来自文末完整源码):
def relu(z: np.ndarray) -> np.ndarray:
return np.maximum(0.0, z)
def relu_grad(z: np.ndarray) -> np.ndarray:
return (z > 0).astype(float)
def softmax(z: np.ndarray) -> np.ndarray:
e = np.exp(z - z.max(axis=1, keepdims=True))
return e / e.sum(axis=1, keepdims=True)
def cross_entropy(pred: np.ndarray, true: np.ndarray) -> float:
return -np.sum(true * np.log(pred + 1e-9)) / true.shape[0]
def onehot(y: np.ndarray, n_classes: int) -> np.ndarray:
oh = np.zeros((len(y), n_classes))
oh[np.arange(len(y)), y] = 1
return oh
可运行演示(补齐 Mock 数据与 print 反馈):
import numpy as np
def relu(z: np.ndarray) -> np.ndarray:
return np.maximum(0.0, z)
def relu_grad(z: np.ndarray) -> np.ndarray:
return (z > 0).astype(float)
def softmax(z: np.ndarray) -> np.ndarray:
e = np.exp(z - z.max(axis=1, keepdims=True))
return e / e.sum(axis=1, keepdims=True)
def cross_entropy(pred: np.ndarray, true: np.ndarray) -> float:
return -np.sum(true * np.log(pred + 1e-9)) / true.shape[0]
def onehot(y: np.ndarray, n_classes: int) -> np.ndarray:
oh = np.zeros((len(y), n_classes))
oh[np.arange(len(y)), y] = 1
return oh
labels = np.array([3, 0, 9])
logits = np.array([
[0.1, 0.2, 0.3, 3.5, 0.1, 0.0, -0.2, 0.4, 0.1, 0.2],
[2.8, 0.1, 0.0, 0.2, 0.1, 0.0, 0.3, 0.1, 0.2, 0.0],
[0.1, 0.0, 0.2, 0.1, 0.3, 0.0, 0.2, 0.1, 0.4, 2.7],
])
truth = onehot(labels, 10)
prob = softmax(logits)
print("标签:", labels.tolist())
print("one-hot 第一行:", truth[0].astype(int).tolist())
print("每行概率和:", np.round(prob.sum(axis=1), 4).tolist())
print("预测数字:", prob.argmax(axis=1).tolist())
print("交叉熵损失:", round(cross_entropy(prob, truth), 4))
Step 2:用 MLP 初始化和 forward 搭出 64→隐藏层→10 的通路
痛点与机制:
sklearn digits 的每张图是 8x8,展平后就是 64 个像素。MLP 的权重矩阵像一排排插线板:第一层把 64 个像素接到隐藏神经元,最后一层把隐藏特征接到 10 个数字概率。这个 Step 只做前向传播,先把数据流向看清楚。
核心源码(逐字来自文末完整源码):
def __init__(self, layer_sizes: list[int], lr: float = 0.01,
seed: int = 42) -> None:
rng = np.random.RandomState(seed)
self.weights: list[np.ndarray] = []
self.biases: list[np.ndarray] = []
for i in range(len(layer_sizes) - 1):
n_in, n_out = layer_sizes[i], layer_sizes[i + 1]
self.weights.append(rng.randn(n_in, n_out) * np.sqrt(2.0 / n_in))
self.biases.append(np.zeros((1, n_out)))
self.lr = lr
self._cache: list[dict] = []
def forward(self, X: np.ndarray) -> np.ndarray:
self._cache = []
A = X
for i, (W, b) in enumerate(zip(self.weights, self.biases)):
Z = A @ W + b
is_last = (i == len(self.weights) - 1)
A_next = softmax(Z) if is_last else relu(Z)
self._cache.append({"A_in": A, "Z": Z, "A_out": A_next})
A = A_next
return A
可运行演示(补齐 Mock 数据与 print 反馈):
import numpy as np
def relu(z: np.ndarray) -> np.ndarray:
return np.maximum(0.0, z)
def relu_grad(z: np.ndarray) -> np.ndarray:
return (z > 0).astype(float)
def softmax(z: np.ndarray) -> np.ndarray:
e = np.exp(z - z.max(axis=1, keepdims=True))
return e / e.sum(axis=1, keepdims=True)
def cross_entropy(pred: np.ndarray, true: np.ndarray) -> float:
return -np.sum(true * np.log(pred + 1e-9)) / true.shape[0]
def onehot(y: np.ndarray, n_classes: int) -> np.ndarray:
oh = np.zeros((len(y), n_classes))
oh[np.arange(len(y)), y] = 1
return oh
class MLP:
"""只保留初始化和前向传播,先看 64 维像素如何流向 10 个数字。"""
def __init__(self, layer_sizes: list[int], lr: float = 0.01,
seed: int = 42) -> None:
rng = np.random.RandomState(seed)
self.weights: list[np.ndarray] = []
self.biases: list[np.ndarray] = []
for i in range(len(layer_sizes) - 1):
n_in, n_out = layer_sizes[i], layer_sizes[i + 1]
self.weights.append(rng.randn(n_in, n_out) * np.sqrt(2.0 / n_in))
self.biases.append(np.zeros((1, n_out)))
self.lr = lr
self._cache: list[dict] = []
def forward(self, X: np.ndarray) -> np.ndarray:
self._cache = []
A = X
for i, (W, b) in enumerate(zip(self.weights, self.biases)):
Z = A @ W + b
is_last = (i == len(self.weights) - 1)
A_next = softmax(Z) if is_last else relu(Z)
self._cache.append({"A_in": A, "Z": Z, "A_out": A_next})
A = A_next
return A
X = np.zeros((2, 64))
X[0, 10:20] = 1.0
X[1, 30:40] = 1.0
model = MLP(layer_sizes=[64, 16, 10], lr=0.05)
prob = model.forward(X)
print("层数:", len(model.weights))
print("W1 形状:", model.weights[0].shape)
print("W2 形状:", model.weights[1].shape)
print("输出概率形状:", prob.shape)
print("两个样本的预测数字:", prob.argmax(axis=1).tolist())
Step 3:用 backward 让错误从输出层倒着传回去
痛点与机制:
如果前向传播是答题,反向传播就是订正。模型先知道哪个数字猜错了,再把错误从输出层一层层倒回隐藏层,计算每条连接该往哪个方向调整。演示会比较一次参数更新前后的损失,让读者看到“学习”确实发生了。
核心源码(逐字来自文末完整源码):
def backward(self, y_true: np.ndarray) -> None:
n = y_true.shape[0]
# 输出层梯度
dZ = (self._cache[-1]["A_out"] - y_true) / n
for i in reversed(range(len(self.weights))):
A_in = self._cache[i]["A_in"]
Z = self._cache[i]["Z"]
dW = A_in.T @ dZ
db = dZ.sum(axis=0, keepdims=True)
if i > 0:
dA = dZ @ self.weights[i].T
dZ = dA * relu_grad(self._cache[i - 1]["Z"])
self.weights[i] -= self.lr * dW
self.biases[i] -= self.lr * db
可运行演示(补齐 Mock 数据与 print 反馈):
import numpy as np
def relu(z: np.ndarray) -> np.ndarray:
return np.maximum(0.0, z)
def relu_grad(z: np.ndarray) -> np.ndarray:
return (z > 0).astype(float)
def softmax(z: np.ndarray) -> np.ndarray:
e = np.exp(z - z.max(axis=1, keepdims=True))
return e / e.sum(axis=1, keepdims=True)
def cross_entropy(pred: np.ndarray, true: np.ndarray) -> float:
return -np.sum(true * np.log(pred + 1e-9)) / true.shape[0]
def onehot(y: np.ndarray, n_classes: int) -> np.ndarray:
oh = np.zeros((len(y), n_classes))
oh[np.arange(len(y)), y] = 1
return oh
class MLP:
"""多层全连接网络,支持任意层数。"""
def __init__(self, layer_sizes: list[int], lr: float = 0.01,
seed: int = 42) -> None:
rng = np.random.RandomState(seed)
self.weights: list[np.ndarray] = []
self.biases: list[np.ndarray] = []
for i in range(len(layer_sizes) - 1):
n_in, n_out = layer_sizes[i], layer_sizes[i + 1]
self.weights.append(rng.randn(n_in, n_out) * np.sqrt(2.0 / n_in))
self.biases.append(np.zeros((1, n_out)))
self.lr = lr
self._cache: list[dict] = []
def forward(self, X: np.ndarray) -> np.ndarray:
self._cache = []
A = X
for i, (W, b) in enumerate(zip(self.weights, self.biases)):
Z = A @ W + b
is_last = (i == len(self.weights) - 1)
A_next = softmax(Z) if is_last else relu(Z)
self._cache.append({"A_in": A, "Z": Z, "A_out": A_next})
A = A_next
return A
def backward(self, y_true: np.ndarray) -> None:
n = y_true.shape[0]
# 输出层梯度
dZ = (self._cache[-1]["A_out"] - y_true) / n
for i in reversed(range(len(self.weights))):
A_in = self._cache[i]["A_in"]
Z = self._cache[i]["Z"]
dW = A_in.T @ dZ
db = dZ.sum(axis=0, keepdims=True)
if i > 0:
dA = dZ @ self.weights[i].T
dZ = dA * relu_grad(self._cache[i - 1]["Z"])
self.weights[i] -= self.lr * dW
self.biases[i] -= self.lr * db
def fit(self, X: np.ndarray, y_oh: np.ndarray,
epochs: int = 300, batch_size: int = 64) -> list[float]:
n = X.shape[0]
history: list[float] = []
for epoch in range(1, epochs + 1):
idx = np.random.permutation(n)
X_s, y_s = X[idx], y_oh[idx]
loss = 0.0
for s in range(0, n, batch_size):
Xb, yb = X_s[s:s+batch_size], y_s[s:s+batch_size]
pred = self.forward(Xb)
loss += cross_entropy(pred, yb) * len(Xb)
self.backward(yb)
history.append(loss / n)
return history
def predict(self, X: np.ndarray) -> np.ndarray:
return self.forward(X).argmax(axis=1)
X = np.eye(4, 64)[:4]
y = onehot(np.array([0, 1, 2, 3]), 10)
model = MLP(layer_sizes=[64, 12, 10], lr=0.1)
before = cross_entropy(model.forward(X), y)
model.backward(y)
after = cross_entropy(model.forward(X), y)
print("反向传播前损失:", round(before, 4))
print("反向传播后损失:", round(after, 4))
print("缓存层数:", len(model._cache))
print("直觉:错误从输出层往回走,每层权重都被轻轻拧了一下。")
Step 4:用 fit 把一次订正变成多轮训练
痛点与机制:
一次 backward 只是改一点点,真正训练要重复很多轮。fit() 每轮先打乱数据,再按 batch 分小组训练,像老师把一大摞作业分成几叠批改。predict() 则把最终概率最大的下标当作识别出的数字。
核心源码(逐字来自文末完整源码):
def fit(self, X: np.ndarray, y_oh: np.ndarray,
epochs: int = 300, batch_size: int = 64) -> list[float]:
n = X.shape[0]
history: list[float] = []
for epoch in range(1, epochs + 1):
idx = np.random.permutation(n)
X_s, y_s = X[idx], y_oh[idx]
loss = 0.0
for s in range(0, n, batch_size):
Xb, yb = X_s[s:s+batch_size], y_s[s:s+batch_size]
pred = self.forward(Xb)
loss += cross_entropy(pred, yb) * len(Xb)
self.backward(yb)
history.append(loss / n)
return history
def predict(self, X: np.ndarray) -> np.ndarray:
return self.forward(X).argmax(axis=1)
可运行演示(补齐 Mock 数据与 print 反馈):
import numpy as np
def relu(z: np.ndarray) -> np.ndarray:
return np.maximum(0.0, z)
def relu_grad(z: np.ndarray) -> np.ndarray:
return (z > 0).astype(float)
def softmax(z: np.ndarray) -> np.ndarray:
e = np.exp(z - z.max(axis=1, keepdims=True))
return e / e.sum(axis=1, keepdims=True)
def cross_entropy(pred: np.ndarray, true: np.ndarray) -> float:
return -np.sum(true * np.log(pred + 1e-9)) / true.shape[0]
def onehot(y: np.ndarray, n_classes: int) -> np.ndarray:
oh = np.zeros((len(y), n_classes))
oh[np.arange(len(y)), y] = 1
return oh
class MLP:
"""多层全连接网络,支持任意层数。"""
def __init__(self, layer_sizes: list[int], lr: float = 0.01,
seed: int = 42) -> None:
rng = np.random.RandomState(seed)
self.weights: list[np.ndarray] = []
self.biases: list[np.ndarray] = []
for i in range(len(layer_sizes) - 1):
n_in, n_out = layer_sizes[i], layer_sizes[i + 1]
self.weights.append(rng.randn(n_in, n_out) * np.sqrt(2.0 / n_in))
self.biases.append(np.zeros((1, n_out)))
self.lr = lr
self._cache: list[dict] = []
def forward(self, X: np.ndarray) -> np.ndarray:
self._cache = []
A = X
for i, (W, b) in enumerate(zip(self.weights, self.biases)):
Z = A @ W + b
is_last = (i == len(self.weights) - 1)
A_next = softmax(Z) if is_last else relu(Z)
self._cache.append({"A_in": A, "Z": Z, "A_out": A_next})
A = A_next
return A
def backward(self, y_true: np.ndarray) -> None:
n = y_true.shape[0]
# 输出层梯度
dZ = (self._cache[-1]["A_out"] - y_true) / n
for i in reversed(range(len(self.weights))):
A_in = self._cache[i]["A_in"]
Z = self._cache[i]["Z"]
dW = A_in.T @ dZ
db = dZ.sum(axis=0, keepdims=True)
if i > 0:
dA = dZ @ self.weights[i].T
dZ = dA * relu_grad(self._cache[i - 1]["Z"])
self.weights[i] -= self.lr * dW
self.biases[i] -= self.lr * db
def fit(self, X: np.ndarray, y_oh: np.ndarray,
epochs: int = 300, batch_size: int = 64) -> list[float]:
n = X.shape[0]
history: list[float] = []
for epoch in range(1, epochs + 1):
idx = np.random.permutation(n)
X_s, y_s = X[idx], y_oh[idx]
loss = 0.0
for s in range(0, n, batch_size):
Xb, yb = X_s[s:s+batch_size], y_s[s:s+batch_size]
pred = self.forward(Xb)
loss += cross_entropy(pred, yb) * len(Xb)
self.backward(yb)
history.append(loss / n)
return history
def predict(self, X: np.ndarray) -> np.ndarray:
return self.forward(X).argmax(axis=1)
rng = np.random.RandomState(42)
X = rng.randn(30, 64)
y_label = rng.randint(0, 10, size=30)
y = onehot(y_label, 10)
model = MLP(layer_sizes=[64, 20, 10], lr=0.05)
history = model.fit(X, y, epochs=8, batch_size=10)
print("前 4 轮损失:", [round(v, 4) for v in history[:4]])
print("后 4 轮损失:", [round(v, 4) for v in history[-4:]])
print("预测前 6 个:", model.predict(X[:6]).tolist())
print("真实前 6 个:", y_label[:6].tolist())
Step 5:用 load_data 读取 sklearn 内置手写数字并标准化
痛点与机制:
真实 MNIST 要下载,教程里用 sklearn 自带的 digits 小数据集,形状更小、运行更快,但任务本质一样。StandardScaler 像统一亮度刻度,把像素特征拉到更适合训练的范围,避免某些特征因为数值大而“嗓门太大”。
核心源码(逐字来自文末完整源码):
def load_data() -> tuple:
digits = load_digits() # 8×8 像素,10类,1797样本
X, y = digits.data, digits.target
scaler = StandardScaler()
X = scaler.fit_transform(X)
return train_test_split(X, y, test_size=0.2, random_state=42)
可运行演示(补齐 Mock 数据与 print 反馈):
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_data() -> tuple:
digits = load_digits() # 8×8 像素,10类,1797样本
X, y = digits.data, digits.target
scaler = StandardScaler()
X = scaler.fit_transform(X)
return train_test_split(X, y, test_size=0.2, random_state=42)
X_tr, X_te, y_tr, y_te = load_data()
print("训练集:", X_tr.shape, "测试集:", X_te.shape)
print("特征均值约为:", round(float(X_tr.mean()), 4))
print("特征标准差约为:", round(float(X_tr.std()), 4))
print("前 10 个标签:", y_tr[:10].tolist())
print("说明:digits 是 sklearn 内置 8x8 手写数字,不需要联网下载。")
Step 6:用 mode_data 把 8×8 数字画成终端字符图
痛点与机制:
图像数据如果只看一串 64 维数字,新手很难建立直觉。mode_data() 把像素亮度转成 █ 和 ·,相当于在终端里画迷你数字。看到字符图之后,读者会明白:模型其实就是在根据这些亮暗格子判断数字。
核心源码(逐字来自文末完整源码):
def mode_data() -> None:
print("\n" + "="*60 + "\n 数据集可视化(sklearn digits,8×8 像素)\n" + "="*60)
digits = load_digits()
X, y = digits.data, digits.target
print(f"\n 数据集: {X.shape[0]} 样本 特征维度: {X.shape[1]}(8×8展平)")
print(f" 类别: {sorted(set(y))} 每类约 {len(y)//10} 个样本\n")
# ASCII 渲染前10个数字
print(" 前10个样本的 ASCII 渲染(阈值=8):")
for idx in range(10):
img = digits.images[idx] # (8, 8)
label = y[idx]
row_str = " ".join(
"".join("█" if px > 8 else "·" for px in row)
for row in img
)
# 只打印第4行作为缩略
mid_row = "".join("█" if px > 8 else "·" for px in img[3])
print(f" [{label}] {mid_row}")
print("\n 完整渲染(数字 '3'):")
idx_3 = np.where(y == 3)[0][0]
for row in digits.images[idx_3]:
print(" " + "".join("█" if px > 8 else "·" for px in row))
可运行演示(补齐 Mock 数据与 print 反馈):
import numpy as np
from sklearn.datasets import load_digits
def mode_data() -> None:
print("\n" + "="*60 + "\n 数据集可视化(sklearn digits,8×8 像素)\n" + "="*60)
digits = load_digits()
X, y = digits.data, digits.target
print(f"\n 数据集: {X.shape[0]} 样本 特征维度: {X.shape[1]}(8×8展平)")
print(f" 类别: {sorted(set(y))} 每类约 {len(y)//10} 个样本\n")
# ASCII 渲染前10个数字
print(" 前10个样本的 ASCII 渲染(阈值=8):")
for idx in range(10):
img = digits.images[idx] # (8, 8)
label = y[idx]
row_str = " ".join(
"".join("█" if px > 8 else "·" for px in row)
for row in img
)
# 只打印第4行作为缩略
mid_row = "".join("█" if px > 8 else "·" for px in img[3])
print(f" [{label}] {mid_row}")
print("\n 完整渲染(数字 '3'):")
idx_3 = np.where(y == 3)[0][0]
for row in digits.images[idx_3]:
print(" " + "".join("█" if px > 8 else "·" for px in row))
mode_data()
Step 7:用 mode_train 训练 64→128→64→10 的全连接网络
痛点与机制:
这一段把前面所有积木串起来:加载数据、one-hot 标签、创建 MLP、训练 300 轮、打印损失曲线和测试准确率。损失条形图像体温计,越短说明模型错得越少;准确率则告诉我们最终识别效果。
核心源码(逐字来自文末完整源码):
def mode_train() -> None:
global _trained_model, _test_data
print("\n" + "="*60 + "\n 训练全连接网络(64→128→64→10)\n" + "="*60)
X_tr, X_te, y_tr, y_te = load_data()
_test_data = (X_te, y_te)
y_tr_oh = onehot(y_tr, 10)
model = MLP(layer_sizes=[64, 128, 64, 10], lr=0.05)
print(f"\n 网络: 64→128(ReLU)→64(ReLU)→10(Softmax)")
print(f" 训练集: {len(X_tr)} 测试集: {len(X_te)} epochs=300\n")
history = model.fit(X_tr, y_tr_oh, epochs=300, batch_size=32)
_trained_model = model
# 损失曲线
print(" 训练损失曲线(每60轮)")
checkpoints = [0, 60, 120, 180, 240, 299]
max_loss = history[0]
W = 35
for ep in checkpoints:
loss = history[ep]
bar = "█" * int(loss / max_loss * W)
print(f" epoch {ep+1:>3} │{bar:<{W}}│ {loss:.4f}")
acc = accuracy_score(y_te, model.predict(X_te))
print(f"\n 测试准确率: {acc:.4f} ({acc*100:.1f}%)")
可运行演示(补齐 Mock 数据与 print 反馈):
from typing import Optional
import numpy as np
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def relu(z: np.ndarray) -> np.ndarray:
return np.maximum(0.0, z)
def relu_grad(z: np.ndarray) -> np.ndarray:
return (z > 0).astype(float)
def softmax(z: np.ndarray) -> np.ndarray:
e = np.exp(z - z.max(axis=1, keepdims=True))
return e / e.sum(axis=1, keepdims=True)
def cross_entropy(pred: np.ndarray, true: np.ndarray) -> float:
return -np.sum(true * np.log(pred + 1e-9)) / true.shape[0]
def onehot(y: np.ndarray, n_classes: int) -> np.ndarray:
oh = np.zeros((len(y), n_classes))
oh[np.arange(len(y)), y] = 1
return oh
class MLP:
"""多层全连接网络,支持任意层数。"""
def __init__(self, layer_sizes: list[int], lr: float = 0.01,
seed: int = 42) -> None:
rng = np.random.RandomState(seed)
self.weights: list[np.ndarray] = []
self.biases: list[np.ndarray] = []
for i in range(len(layer_sizes) - 1):
n_in, n_out = layer_sizes[i], layer_sizes[i + 1]
self.weights.append(rng.randn(n_in, n_out) * np.sqrt(2.0 / n_in))
self.biases.append(np.zeros((1, n_out)))
self.lr = lr
self._cache: list[dict] = []
def forward(self, X: np.ndarray) -> np.ndarray:
self._cache = []
A = X
for i, (W, b) in enumerate(zip(self.weights, self.biases)):
Z = A @ W + b
is_last = (i == len(self.weights) - 1)
A_next = softmax(Z) if is_last else relu(Z)
self._cache.append({"A_in": A, "Z": Z, "A_out": A_next})
A = A_next
return A
def backward(self, y_true: np.ndarray) -> None:
n = y_true.shape[0]
# 输出层梯度
dZ = (self._cache[-1]["A_out"] - y_true) / n
for i in reversed(range(len(self.weights))):
A_in = self._cache[i]["A_in"]
Z = self._cache[i]["Z"]
dW = A_in.T @ dZ
db = dZ.sum(axis=0, keepdims=True)
if i > 0:
dA = dZ @ self.weights[i].T
dZ = dA * relu_grad(self._cache[i - 1]["Z"])
self.weights[i] -= self.lr * dW
self.biases[i] -= self.lr * db
def fit(self, X: np.ndarray, y_oh: np.ndarray,
epochs: int = 300, batch_size: int = 64) -> list[float]:
n = X.shape[0]
history: list[float] = []
for epoch in range(1, epochs + 1):
idx = np.random.permutation(n)
X_s, y_s = X[idx], y_oh[idx]
loss = 0.0
for s in range(0, n, batch_size):
Xb, yb = X_s[s:s+batch_size], y_s[s:s+batch_size]
pred = self.forward(Xb)
loss += cross_entropy(pred, yb) * len(Xb)
self.backward(yb)
history.append(loss / n)
return history
def predict(self, X: np.ndarray) -> np.ndarray:
return self.forward(X).argmax(axis=1)
def load_data() -> tuple:
digits = load_digits() # 8×8 像素,10类,1797样本
X, y = digits.data, digits.target
scaler = StandardScaler()
X = scaler.fit_transform(X)
return train_test_split(X, y, test_size=0.2, random_state=42)
_trained_model: Optional[MLP] = None
_test_data: Optional[tuple] = None
def mode_train() -> None:
global _trained_model, _test_data
print("\n" + "="*60 + "\n 训练全连接网络(64→128→64→10)\n" + "="*60)
X_tr, X_te, y_tr, y_te = load_data()
_test_data = (X_te, y_te)
y_tr_oh = onehot(y_tr, 10)
model = MLP(layer_sizes=[64, 128, 64, 10], lr=0.05)
print(f"\n 网络: 64→128(ReLU)→64(ReLU)→10(Softmax)")
print(f" 训练集: {len(X_tr)} 测试集: {len(X_te)} epochs=300\n")
history = model.fit(X_tr, y_tr_oh, epochs=300, batch_size=32)
_trained_model = model
# 损失曲线
print(" 训练损失曲线(每60轮)")
checkpoints = [0, 60, 120, 180, 240, 299]
max_loss = history[0]
W = 35
for ep in checkpoints:
loss = history[ep]
bar = "█" * int(loss / max_loss * W)
print(f" epoch {ep+1:>3} │{bar:<{W}}│ {loss:.4f}")
acc = accuracy_score(y_te, model.predict(X_te))
print(f"\n 测试准确率: {acc:.4f} ({acc*100:.1f}%)")
mode_train()
Step 8:用 mode_eval 看每个数字的准确率和混淆矩阵
痛点与机制:
只看整体准确率还不够,因为模型可能特别擅长识别 1,却经常把 9 看成 8。混淆矩阵像错题分类表:行是真实数字,列是预测数字,非对角线的数字越多,说明这两类越容易混淆。
核心源码(逐字来自文末完整源码):
def mode_eval() -> None:
global _trained_model, _test_data
print("\n" + "="*60 + "\n 评估与混淆矩阵分析\n" + "="*60)
if _trained_model is None:
mode_train()
X_te, y_te = _test_data
y_pred = _trained_model.predict(X_te)
acc = accuracy_score(y_te, y_pred)
cm = confusion_matrix(y_te, y_pred)
# 每类准确率
print(f"\n 整体准确率: {acc:.4f}\n")
print(f" {'数字':<6} {'正确':<6} {'总数':<6} {'准确率':<8} 条形图")
print(f" {'─'*50}")
for digit in range(10):
mask = y_te == digit
correct = (y_pred[mask] == digit).sum()
total = mask.sum()
rate = correct / total if total > 0 else 0
bar = "█" * int(rate * 20)
print(f" {digit:<6} {correct:<6} {total:<6} {rate:.3f} {bar}")
# ASCII 混淆矩阵(只显示错误)
print(f"\n 混淆矩阵(行=真实,列=预测,只显示非零错误):")
print(f" 真\\预 " + " ".join(f"{i:2}" for i in range(10)))
print(f" {'─'*40}")
for i in range(10):
row_str = " ".join(
f"\033[31m{cm[i,j]:2}\033[0m" if cm[i,j] > 0 and i != j
else f"{cm[i,j]:2}"
for j in range(10)
)
print(f" {i:2} {row_str}")
可运行演示(补齐 Mock 数据与 print 反馈):
from typing import Optional
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix
class TinyModel:
def predict(self, X: np.ndarray) -> np.ndarray:
# 演示用:故意让一个样本预测错,方便看到混淆矩阵里的红色错误。
return np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 8])
_trained_model: Optional[TinyModel] = TinyModel()
_test_data: Optional[tuple] = (np.zeros((10, 64)), np.arange(10))
def mode_train() -> None:
print("演示环境已准备 TinyModel,不需要重新训练。")
def mode_eval() -> None:
global _trained_model, _test_data
print("\n" + "="*60 + "\n 评估与混淆矩阵分析\n" + "="*60)
if _trained_model is None:
mode_train()
X_te, y_te = _test_data
y_pred = _trained_model.predict(X_te)
acc = accuracy_score(y_te, y_pred)
cm = confusion_matrix(y_te, y_pred)
# 每类准确率
print(f"\n 整体准确率: {acc:.4f}\n")
print(f" {'数字':<6} {'正确':<6} {'总数':<6} {'准确率':<8} 条形图")
print(f" {'─'*50}")
for digit in range(10):
mask = y_te == digit
correct = (y_pred[mask] == digit).sum()
total = mask.sum()
rate = correct / total if total > 0 else 0
bar = "█" * int(rate * 20)
print(f" {digit:<6} {correct:<6} {total:<6} {rate:.3f} {bar}")
# ASCII 混淆矩阵(只显示错误)
print(f"\n 混淆矩阵(行=真实,列=预测,只显示非零错误):")
print(f" 真\\预 " + " ".join(f"{i:2}" for i in range(10)))
print(f" {'─'*40}")
for i in range(10):
row_str = " ".join(
f"\033[31m{cm[i,j]:2}\033[0m" if cm[i,j] > 0 and i != j
else f"{cm[i,j]:2}"
for j in range(10)
)
print(f" {i:2} {row_str}")
mode_eval()
极客实战:完整源码与运行
现在,把上面的积木拼起来,将下面完整代码保存为 49-python-mnist-mlp.py。它使用 sklearn 内置 digits 数据集,不需要下载外部图片,就能完成数据可视化、MLP 训练和混淆矩阵评估。
#!/usr/bin/env python3
"""
49-python-mnist-mlp.py — MNIST 风格手写数字识别
用法:
python3 49-python-mnist-mlp.py --mode data # 数据集可视化
python3 49-python-mnist-mlp.py --mode train # 训练全连接网络
python3 49-python-mnist-mlp.py --mode eval # 评估与混淆矩阵
python3 49-python-mnist-mlp.py --mode all # 全部(默认)
依赖 numpy + scikit-learn,直接运行。
使用 sklearn 内置的 digits 数据集(8×8 像素,1797 样本),
与 MNIST(28×28,70000 样本)同类问题,无需下载。
"""
import argparse
from typing import Optional
import numpy as np
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def relu(z: np.ndarray) -> np.ndarray:
return np.maximum(0.0, z)
def relu_grad(z: np.ndarray) -> np.ndarray:
return (z > 0).astype(float)
def softmax(z: np.ndarray) -> np.ndarray:
e = np.exp(z - z.max(axis=1, keepdims=True))
return e / e.sum(axis=1, keepdims=True)
def cross_entropy(pred: np.ndarray, true: np.ndarray) -> float:
return -np.sum(true * np.log(pred + 1e-9)) / true.shape[0]
def onehot(y: np.ndarray, n_classes: int) -> np.ndarray:
oh = np.zeros((len(y), n_classes))
oh[np.arange(len(y)), y] = 1
return oh
class MLP:
"""多层全连接网络,支持任意层数。"""
def __init__(self, layer_sizes: list[int], lr: float = 0.01,
seed: int = 42) -> None:
rng = np.random.RandomState(seed)
self.weights: list[np.ndarray] = []
self.biases: list[np.ndarray] = []
for i in range(len(layer_sizes) - 1):
n_in, n_out = layer_sizes[i], layer_sizes[i + 1]
self.weights.append(rng.randn(n_in, n_out) * np.sqrt(2.0 / n_in))
self.biases.append(np.zeros((1, n_out)))
self.lr = lr
self._cache: list[dict] = []
def forward(self, X: np.ndarray) -> np.ndarray:
self._cache = []
A = X
for i, (W, b) in enumerate(zip(self.weights, self.biases)):
Z = A @ W + b
is_last = (i == len(self.weights) - 1)
A_next = softmax(Z) if is_last else relu(Z)
self._cache.append({"A_in": A, "Z": Z, "A_out": A_next})
A = A_next
return A
def backward(self, y_true: np.ndarray) -> None:
n = y_true.shape[0]
# 输出层梯度
dZ = (self._cache[-1]["A_out"] - y_true) / n
for i in reversed(range(len(self.weights))):
A_in = self._cache[i]["A_in"]
Z = self._cache[i]["Z"]
dW = A_in.T @ dZ
db = dZ.sum(axis=0, keepdims=True)
if i > 0:
dA = dZ @ self.weights[i].T
dZ = dA * relu_grad(self._cache[i - 1]["Z"])
self.weights[i] -= self.lr * dW
self.biases[i] -= self.lr * db
def fit(self, X: np.ndarray, y_oh: np.ndarray,
epochs: int = 300, batch_size: int = 64) -> list[float]:
n = X.shape[0]
history: list[float] = []
for epoch in range(1, epochs + 1):
idx = np.random.permutation(n)
X_s, y_s = X[idx], y_oh[idx]
loss = 0.0
for s in range(0, n, batch_size):
Xb, yb = X_s[s:s+batch_size], y_s[s:s+batch_size]
pred = self.forward(Xb)
loss += cross_entropy(pred, yb) * len(Xb)
self.backward(yb)
history.append(loss / n)
return history
def predict(self, X: np.ndarray) -> np.ndarray:
return self.forward(X).argmax(axis=1)
# ─── 数据加载 ──────────────────────────────────────────────────────────────────
def load_data() -> tuple:
digits = load_digits() # 8×8 像素,10类,1797样本
X, y = digits.data, digits.target
scaler = StandardScaler()
X = scaler.fit_transform(X)
return train_test_split(X, y, test_size=0.2, random_state=42)
# ─── 模式1:数据可视化 ─────────────────────────────────────────────────────────
def mode_data() -> None:
print("\n" + "="*60 + "\n 数据集可视化(sklearn digits,8×8 像素)\n" + "="*60)
digits = load_digits()
X, y = digits.data, digits.target
print(f"\n 数据集: {X.shape[0]} 样本 特征维度: {X.shape[1]}(8×8展平)")
print(f" 类别: {sorted(set(y))} 每类约 {len(y)//10} 个样本\n")
# ASCII 渲染前10个数字
print(" 前10个样本的 ASCII 渲染(阈值=8):")
for idx in range(10):
img = digits.images[idx] # (8, 8)
label = y[idx]
row_str = " ".join(
"".join("█" if px > 8 else "·" for px in row)
for row in img
)
# 只打印第4行作为缩略
mid_row = "".join("█" if px > 8 else "·" for px in img[3])
print(f" [{label}] {mid_row}")
print("\n 完整渲染(数字 '3'):")
idx_3 = np.where(y == 3)[0][0]
for row in digits.images[idx_3]:
print(" " + "".join("█" if px > 8 else "·" for px in row))
# ─── 模式2:训练 ───────────────────────────────────────────────────────────────
_trained_model: Optional[MLP] = None
_test_data: Optional[tuple] = None
def mode_train() -> None:
global _trained_model, _test_data
print("\n" + "="*60 + "\n 训练全连接网络(64→128→64→10)\n" + "="*60)
X_tr, X_te, y_tr, y_te = load_data()
_test_data = (X_te, y_te)
y_tr_oh = onehot(y_tr, 10)
model = MLP(layer_sizes=[64, 128, 64, 10], lr=0.05)
print(f"\n 网络: 64→128(ReLU)→64(ReLU)→10(Softmax)")
print(f" 训练集: {len(X_tr)} 测试集: {len(X_te)} epochs=300\n")
history = model.fit(X_tr, y_tr_oh, epochs=300, batch_size=32)
_trained_model = model
# 损失曲线
print(" 训练损失曲线(每60轮)")
checkpoints = [0, 60, 120, 180, 240, 299]
max_loss = history[0]
W = 35
for ep in checkpoints:
loss = history[ep]
bar = "█" * int(loss / max_loss * W)
print(f" epoch {ep+1:>3} │{bar:<{W}}│ {loss:.4f}")
acc = accuracy_score(y_te, model.predict(X_te))
print(f"\n 测试准确率: {acc:.4f} ({acc*100:.1f}%)")
# ─── 模式3:评估与混淆矩阵 ────────────────────────────────────────────────────
def mode_eval() -> None:
global _trained_model, _test_data
print("\n" + "="*60 + "\n 评估与混淆矩阵分析\n" + "="*60)
if _trained_model is None:
mode_train()
X_te, y_te = _test_data
y_pred = _trained_model.predict(X_te)
acc = accuracy_score(y_te, y_pred)
cm = confusion_matrix(y_te, y_pred)
# 每类准确率
print(f"\n 整体准确率: {acc:.4f}\n")
print(f" {'数字':<6} {'正确':<6} {'总数':<6} {'准确率':<8} 条形图")
print(f" {'─'*50}")
for digit in range(10):
mask = y_te == digit
correct = (y_pred[mask] == digit).sum()
total = mask.sum()
rate = correct / total if total > 0 else 0
bar = "█" * int(rate * 20)
print(f" {digit:<6} {correct:<6} {total:<6} {rate:.3f} {bar}")
# ASCII 混淆矩阵(只显示错误)
print(f"\n 混淆矩阵(行=真实,列=预测,只显示非零错误):")
print(f" 真\\预 " + " ".join(f"{i:2}" for i in range(10)))
print(f" {'─'*40}")
for i in range(10):
row_str = " ".join(
f"\033[31m{cm[i,j]:2}\033[0m" if cm[i,j] > 0 and i != j
else f"{cm[i,j]:2}"
for j in range(10)
)
print(f" {i:2} {row_str}")
# ─── 入口 ─────────────────────────────────────────────────────────────────────
def main() -> None:
parser = argparse.ArgumentParser(description="MNIST 风格手写数字识别")
parser.add_argument(
"--mode",
choices=["data", "train", "eval", "all"],
default="all",
)
args = parser.parse_args()
dispatch = {
"data": mode_data,
"train": mode_train,
"eval": mode_eval,
"all": lambda: [mode_data(), mode_train(), mode_eval()],
}
dispatch[args.mode]()
if __name__ == "__main__":
main()
$ python 49-python-mnist-mlp.py --mode data
============================================================
数据集可视化(sklearn digits,8×8 像素)
============================================================
数据集: 1797 样本 特征维度: 64(8×8展平)
类别: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 每类约 179 个样本
前10个样本的 ASCII 渲染(阈值=8):
[0] ··█·····
[1] ··███···
[2] ····██··
[3] ···██···
[4] ···█·█··
[5] ··███···
[6] ··██····
[7] ····██··
[8] ···███··
[9] ··█·██··
完整渲染(数字 '3'):
···██···
··█·█···
···██···
···██···
····██··
·····█··
·····██·
···███··
$ python 49-python-mnist-mlp.py --mode eval
============================================================
评估与混淆矩阵分析
============================================================
============================================================
训练全连接网络(64→128→64→10)
============================================================
网络: 64→128(ReLU)→64(ReLU)→10(Softmax)
训练集: 1437 测试集: 360 epochs=300
训练损失曲线(每60轮)
epoch 1 │███████████████████████████████████│ 1.1440
epoch 61 │ │ 0.0031
epoch 121 │ │ 0.0013
epoch 181 │ │ 0.0008
epoch 241 │ │ 0.0005
epoch 300 │ │ 0.0004
测试准确率: 0.9833 (98.3%)
整体准确率: 0.9833
数字 正确 总数 准确率 条形图
──────────────────────────────────────────────────
0 33 33 1.000 ████████████████████
1 28 28 1.000 ████████████████████
2 33 33 1.000 ████████████████████
3 32 34 0.941 ██████████████████
4 46 46 1.000 ████████████████████
5 46 47 0.979 ███████████████████
6 34 35 0.971 ███████████████████
7 33 34 0.971 ███████████████████
8 30 30 1.000 ████████████████████
9 39 40 0.975 ███████████████████
混淆矩阵(行=真实,列=预测,只显示非零错误):
真\预 0 1 2 3 4 5 6 7 8 9
────────────────────────────────────────
0 33 0 0 0 0 0 0 0 0 0
1 0 28 0 0 0 0 0 0 0 0
2 0 0 33 0 0 0 0 0 0 0
3 0 0 [31m 1[0m 32 0 [31m 1[0m 0 0 0 0
4 0 0 0 0 46 0 0 0 0 0
5 0 0 0 0 0 46 [31m 1[0m 0 0 0
6 0 0 0 0 0 [31m 1[0m 34 0 0 0
小结与 NexDo Time ⚡
这一篇你完成了一个真正的图像分类闭环:把 8x8 像素变成 64 维特征,把数字标签变成 one-hot,把 MLP 输出变成 10 类概率,再用混淆矩阵检查模型到底错在哪里。
5 分钟微操挑战:把 MLP(layer_sizes=[64, 128, 64, 10], lr=0.05) 改成 [64, 32, 10] 和 [64, 256, 128, 10],分别运行 --mode eval,比较准确率和训练耗时。
Don’t wait for next time, do it in the next moment.