文章

49 · MNIST 手写数字识别:全连接网络实战

#030 · 2026-04-16 · Python

🔗 知识图谱导航:阅读本文前,建议先回顾《48 · 神经网络基础:从零手写前向传播与反向传播》里的 MLP、Softmax、交叉熵和 Mini-batch 训练。本文会把那套手写网络用到手写数字识别任务上。 NexDo Time · 2026-04-17 · 预计阅读 30 分钟

痛点与架构

手写数字识别是深度学习的经典入门任务:给模型一张小图片,让它判断图片里写的是 0 到 9 哪个数字。很多教程一上来就下载 MNIST、调用框架训练,读者只看到准确率,却看不懂像素、标签、网络输出之间怎么连起来。

本文用 sklearn 内置的 digits 数据集替代完整 MNIST。它每张图是 8x8 像素,共 1797 个样本,不需要联网下载,适合在普通电脑上快速跑通。流程很清楚:图片展平成 64 维向量,MLP 输出 10 个概率,概率最大的下标就是预测数字。

8x8 图片 -> 展平 64 维 -> MLP(64 -> 128 -> 64 -> 10)
      -> Softmax 概率 -> argmax -> 预测数字
      -> 混淆矩阵 -> 看清楚哪些数字容易认错

步步为营:核心逻辑自适应拆解

这一篇从“像素怎么进入网络”讲到“混淆矩阵怎么分析错误”,拆成 8 个小步骤。每一步都可以单独复制运行,先看到输出,再理解原理。

Step 1:用 onehot 和 softmax 把数字标签变成可训练目标

痛点与机制

手写数字识别是 10 分类问题,标签 7 不能直接拿去和 10 个输出概率相乘,所以要先变成 one-hot。你可以把 one-hot 想成答题卡:正确数字那一格涂黑,其余空着。Softmax 再把模型的 10 个原始打分变成概率,交叉熵负责计算“这张答题卡错得有多离谱”。

核心源码(逐字来自文末完整源码)

def relu(z: np.ndarray) -> np.ndarray:
    return np.maximum(0.0, z)

def relu_grad(z: np.ndarray) -> np.ndarray:
    return (z > 0).astype(float)

def softmax(z: np.ndarray) -> np.ndarray:
    e = np.exp(z - z.max(axis=1, keepdims=True))
    return e / e.sum(axis=1, keepdims=True)

def cross_entropy(pred: np.ndarray, true: np.ndarray) -> float:
    return -np.sum(true * np.log(pred + 1e-9)) / true.shape[0]

def onehot(y: np.ndarray, n_classes: int) -> np.ndarray:
    oh = np.zeros((len(y), n_classes))
    oh[np.arange(len(y)), y] = 1
    return oh

可运行演示(补齐 Mock 数据与 print 反馈)

import numpy as np

def relu(z: np.ndarray) -> np.ndarray:
    return np.maximum(0.0, z)

def relu_grad(z: np.ndarray) -> np.ndarray:
    return (z > 0).astype(float)

def softmax(z: np.ndarray) -> np.ndarray:
    e = np.exp(z - z.max(axis=1, keepdims=True))
    return e / e.sum(axis=1, keepdims=True)

def cross_entropy(pred: np.ndarray, true: np.ndarray) -> float:
    return -np.sum(true * np.log(pred + 1e-9)) / true.shape[0]

def onehot(y: np.ndarray, n_classes: int) -> np.ndarray:
    oh = np.zeros((len(y), n_classes))
    oh[np.arange(len(y)), y] = 1
    return oh

labels = np.array([3, 0, 9])
logits = np.array([
    [0.1, 0.2, 0.3, 3.5, 0.1, 0.0, -0.2, 0.4, 0.1, 0.2],
    [2.8, 0.1, 0.0, 0.2, 0.1, 0.0, 0.3, 0.1, 0.2, 0.0],
    [0.1, 0.0, 0.2, 0.1, 0.3, 0.0, 0.2, 0.1, 0.4, 2.7],
])
truth = onehot(labels, 10)
prob = softmax(logits)

print("标签:", labels.tolist())
print("one-hot 第一行:", truth[0].astype(int).tolist())
print("每行概率和:", np.round(prob.sum(axis=1), 4).tolist())
print("预测数字:", prob.argmax(axis=1).tolist())
print("交叉熵损失:", round(cross_entropy(prob, truth), 4))

Step 2:用 MLP 初始化和 forward 搭出 64→隐藏层→10 的通路

痛点与机制

sklearn digits 的每张图是 8x8,展平后就是 64 个像素。MLP 的权重矩阵像一排排插线板:第一层把 64 个像素接到隐藏神经元,最后一层把隐藏特征接到 10 个数字概率。这个 Step 只做前向传播,先把数据流向看清楚。

核心源码(逐字来自文末完整源码)

    def __init__(self, layer_sizes: list[int], lr: float = 0.01,
                 seed: int = 42) -> None:
        rng = np.random.RandomState(seed)
        self.weights: list[np.ndarray] = []
        self.biases:  list[np.ndarray] = []
        for i in range(len(layer_sizes) - 1):
            n_in, n_out = layer_sizes[i], layer_sizes[i + 1]
            self.weights.append(rng.randn(n_in, n_out) * np.sqrt(2.0 / n_in))
            self.biases.append(np.zeros((1, n_out)))
        self.lr = lr
        self._cache: list[dict] = []

    def forward(self, X: np.ndarray) -> np.ndarray:
        self._cache = []
        A = X
        for i, (W, b) in enumerate(zip(self.weights, self.biases)):
            Z = A @ W + b
            is_last = (i == len(self.weights) - 1)
            A_next = softmax(Z) if is_last else relu(Z)
            self._cache.append({"A_in": A, "Z": Z, "A_out": A_next})
            A = A_next
        return A

可运行演示(补齐 Mock 数据与 print 反馈)

import numpy as np

def relu(z: np.ndarray) -> np.ndarray:
    return np.maximum(0.0, z)

def relu_grad(z: np.ndarray) -> np.ndarray:
    return (z > 0).astype(float)

def softmax(z: np.ndarray) -> np.ndarray:
    e = np.exp(z - z.max(axis=1, keepdims=True))
    return e / e.sum(axis=1, keepdims=True)

def cross_entropy(pred: np.ndarray, true: np.ndarray) -> float:
    return -np.sum(true * np.log(pred + 1e-9)) / true.shape[0]

def onehot(y: np.ndarray, n_classes: int) -> np.ndarray:
    oh = np.zeros((len(y), n_classes))
    oh[np.arange(len(y)), y] = 1
    return oh

class MLP:
    """只保留初始化和前向传播,先看 64 维像素如何流向 10 个数字。"""
    def __init__(self, layer_sizes: list[int], lr: float = 0.01,
                 seed: int = 42) -> None:
        rng = np.random.RandomState(seed)
        self.weights: list[np.ndarray] = []
        self.biases:  list[np.ndarray] = []
        for i in range(len(layer_sizes) - 1):
            n_in, n_out = layer_sizes[i], layer_sizes[i + 1]
            self.weights.append(rng.randn(n_in, n_out) * np.sqrt(2.0 / n_in))
            self.biases.append(np.zeros((1, n_out)))
        self.lr = lr
        self._cache: list[dict] = []

    def forward(self, X: np.ndarray) -> np.ndarray:
        self._cache = []
        A = X
        for i, (W, b) in enumerate(zip(self.weights, self.biases)):
            Z = A @ W + b
            is_last = (i == len(self.weights) - 1)
            A_next = softmax(Z) if is_last else relu(Z)
            self._cache.append({"A_in": A, "Z": Z, "A_out": A_next})
            A = A_next
        return A

X = np.zeros((2, 64))
X[0, 10:20] = 1.0
X[1, 30:40] = 1.0
model = MLP(layer_sizes=[64, 16, 10], lr=0.05)
prob = model.forward(X)

print("层数:", len(model.weights))
print("W1 形状:", model.weights[0].shape)
print("W2 形状:", model.weights[1].shape)
print("输出概率形状:", prob.shape)
print("两个样本的预测数字:", prob.argmax(axis=1).tolist())

Step 3:用 backward 让错误从输出层倒着传回去

痛点与机制

如果前向传播是答题,反向传播就是订正。模型先知道哪个数字猜错了,再把错误从输出层一层层倒回隐藏层,计算每条连接该往哪个方向调整。演示会比较一次参数更新前后的损失,让读者看到“学习”确实发生了。

核心源码(逐字来自文末完整源码)

    def backward(self, y_true: np.ndarray) -> None:
        n = y_true.shape[0]
        # 输出层梯度
        dZ = (self._cache[-1]["A_out"] - y_true) / n
        for i in reversed(range(len(self.weights))):
            A_in = self._cache[i]["A_in"]
            Z    = self._cache[i]["Z"]
            dW = A_in.T @ dZ
            db = dZ.sum(axis=0, keepdims=True)
            if i > 0:
                dA = dZ @ self.weights[i].T
                dZ = dA * relu_grad(self._cache[i - 1]["Z"])
            self.weights[i] -= self.lr * dW
            self.biases[i]  -= self.lr * db

可运行演示(补齐 Mock 数据与 print 反馈)

import numpy as np

def relu(z: np.ndarray) -> np.ndarray:
    return np.maximum(0.0, z)

def relu_grad(z: np.ndarray) -> np.ndarray:
    return (z > 0).astype(float)

def softmax(z: np.ndarray) -> np.ndarray:
    e = np.exp(z - z.max(axis=1, keepdims=True))
    return e / e.sum(axis=1, keepdims=True)

def cross_entropy(pred: np.ndarray, true: np.ndarray) -> float:
    return -np.sum(true * np.log(pred + 1e-9)) / true.shape[0]

def onehot(y: np.ndarray, n_classes: int) -> np.ndarray:
    oh = np.zeros((len(y), n_classes))
    oh[np.arange(len(y)), y] = 1
    return oh

class MLP:
    """多层全连接网络,支持任意层数。"""

    def __init__(self, layer_sizes: list[int], lr: float = 0.01,
                 seed: int = 42) -> None:
        rng = np.random.RandomState(seed)
        self.weights: list[np.ndarray] = []
        self.biases:  list[np.ndarray] = []
        for i in range(len(layer_sizes) - 1):
            n_in, n_out = layer_sizes[i], layer_sizes[i + 1]
            self.weights.append(rng.randn(n_in, n_out) * np.sqrt(2.0 / n_in))
            self.biases.append(np.zeros((1, n_out)))
        self.lr = lr
        self._cache: list[dict] = []

    def forward(self, X: np.ndarray) -> np.ndarray:
        self._cache = []
        A = X
        for i, (W, b) in enumerate(zip(self.weights, self.biases)):
            Z = A @ W + b
            is_last = (i == len(self.weights) - 1)
            A_next = softmax(Z) if is_last else relu(Z)
            self._cache.append({"A_in": A, "Z": Z, "A_out": A_next})
            A = A_next
        return A

    def backward(self, y_true: np.ndarray) -> None:
        n = y_true.shape[0]
        # 输出层梯度
        dZ = (self._cache[-1]["A_out"] - y_true) / n
        for i in reversed(range(len(self.weights))):
            A_in = self._cache[i]["A_in"]
            Z    = self._cache[i]["Z"]
            dW = A_in.T @ dZ
            db = dZ.sum(axis=0, keepdims=True)
            if i > 0:
                dA = dZ @ self.weights[i].T
                dZ = dA * relu_grad(self._cache[i - 1]["Z"])
            self.weights[i] -= self.lr * dW
            self.biases[i]  -= self.lr * db

    def fit(self, X: np.ndarray, y_oh: np.ndarray,
            epochs: int = 300, batch_size: int = 64) -> list[float]:
        n = X.shape[0]
        history: list[float] = []
        for epoch in range(1, epochs + 1):
            idx = np.random.permutation(n)
            X_s, y_s = X[idx], y_oh[idx]
            loss = 0.0
            for s in range(0, n, batch_size):
                Xb, yb = X_s[s:s+batch_size], y_s[s:s+batch_size]
                pred = self.forward(Xb)
                loss += cross_entropy(pred, yb) * len(Xb)
                self.backward(yb)
            history.append(loss / n)
        return history

    def predict(self, X: np.ndarray) -> np.ndarray:
        return self.forward(X).argmax(axis=1)

X = np.eye(4, 64)[:4]
y = onehot(np.array([0, 1, 2, 3]), 10)
model = MLP(layer_sizes=[64, 12, 10], lr=0.1)

before = cross_entropy(model.forward(X), y)
model.backward(y)
after = cross_entropy(model.forward(X), y)

print("反向传播前损失:", round(before, 4))
print("反向传播后损失:", round(after, 4))
print("缓存层数:", len(model._cache))
print("直觉:错误从输出层往回走,每层权重都被轻轻拧了一下。")

Step 4:用 fit 把一次订正变成多轮训练

痛点与机制

一次 backward 只是改一点点,真正训练要重复很多轮。fit() 每轮先打乱数据,再按 batch 分小组训练,像老师把一大摞作业分成几叠批改。predict() 则把最终概率最大的下标当作识别出的数字。

核心源码(逐字来自文末完整源码)

    def fit(self, X: np.ndarray, y_oh: np.ndarray,
            epochs: int = 300, batch_size: int = 64) -> list[float]:
        n = X.shape[0]
        history: list[float] = []
        for epoch in range(1, epochs + 1):
            idx = np.random.permutation(n)
            X_s, y_s = X[idx], y_oh[idx]
            loss = 0.0
            for s in range(0, n, batch_size):
                Xb, yb = X_s[s:s+batch_size], y_s[s:s+batch_size]
                pred = self.forward(Xb)
                loss += cross_entropy(pred, yb) * len(Xb)
                self.backward(yb)
            history.append(loss / n)
        return history

    def predict(self, X: np.ndarray) -> np.ndarray:
        return self.forward(X).argmax(axis=1)

可运行演示(补齐 Mock 数据与 print 反馈)

import numpy as np

def relu(z: np.ndarray) -> np.ndarray:
    return np.maximum(0.0, z)

def relu_grad(z: np.ndarray) -> np.ndarray:
    return (z > 0).astype(float)

def softmax(z: np.ndarray) -> np.ndarray:
    e = np.exp(z - z.max(axis=1, keepdims=True))
    return e / e.sum(axis=1, keepdims=True)

def cross_entropy(pred: np.ndarray, true: np.ndarray) -> float:
    return -np.sum(true * np.log(pred + 1e-9)) / true.shape[0]

def onehot(y: np.ndarray, n_classes: int) -> np.ndarray:
    oh = np.zeros((len(y), n_classes))
    oh[np.arange(len(y)), y] = 1
    return oh

class MLP:
    """多层全连接网络,支持任意层数。"""

    def __init__(self, layer_sizes: list[int], lr: float = 0.01,
                 seed: int = 42) -> None:
        rng = np.random.RandomState(seed)
        self.weights: list[np.ndarray] = []
        self.biases:  list[np.ndarray] = []
        for i in range(len(layer_sizes) - 1):
            n_in, n_out = layer_sizes[i], layer_sizes[i + 1]
            self.weights.append(rng.randn(n_in, n_out) * np.sqrt(2.0 / n_in))
            self.biases.append(np.zeros((1, n_out)))
        self.lr = lr
        self._cache: list[dict] = []

    def forward(self, X: np.ndarray) -> np.ndarray:
        self._cache = []
        A = X
        for i, (W, b) in enumerate(zip(self.weights, self.biases)):
            Z = A @ W + b
            is_last = (i == len(self.weights) - 1)
            A_next = softmax(Z) if is_last else relu(Z)
            self._cache.append({"A_in": A, "Z": Z, "A_out": A_next})
            A = A_next
        return A

    def backward(self, y_true: np.ndarray) -> None:
        n = y_true.shape[0]
        # 输出层梯度
        dZ = (self._cache[-1]["A_out"] - y_true) / n
        for i in reversed(range(len(self.weights))):
            A_in = self._cache[i]["A_in"]
            Z    = self._cache[i]["Z"]
            dW = A_in.T @ dZ
            db = dZ.sum(axis=0, keepdims=True)
            if i > 0:
                dA = dZ @ self.weights[i].T
                dZ = dA * relu_grad(self._cache[i - 1]["Z"])
            self.weights[i] -= self.lr * dW
            self.biases[i]  -= self.lr * db

    def fit(self, X: np.ndarray, y_oh: np.ndarray,
            epochs: int = 300, batch_size: int = 64) -> list[float]:
        n = X.shape[0]
        history: list[float] = []
        for epoch in range(1, epochs + 1):
            idx = np.random.permutation(n)
            X_s, y_s = X[idx], y_oh[idx]
            loss = 0.0
            for s in range(0, n, batch_size):
                Xb, yb = X_s[s:s+batch_size], y_s[s:s+batch_size]
                pred = self.forward(Xb)
                loss += cross_entropy(pred, yb) * len(Xb)
                self.backward(yb)
            history.append(loss / n)
        return history

    def predict(self, X: np.ndarray) -> np.ndarray:
        return self.forward(X).argmax(axis=1)

rng = np.random.RandomState(42)
X = rng.randn(30, 64)
y_label = rng.randint(0, 10, size=30)
y = onehot(y_label, 10)
model = MLP(layer_sizes=[64, 20, 10], lr=0.05)
history = model.fit(X, y, epochs=8, batch_size=10)

print("前 4 轮损失:", [round(v, 4) for v in history[:4]])
print("后 4 轮损失:", [round(v, 4) for v in history[-4:]])
print("预测前 6 个:", model.predict(X[:6]).tolist())
print("真实前 6 个:", y_label[:6].tolist())

Step 5:用 load_data 读取 sklearn 内置手写数字并标准化

痛点与机制

真实 MNIST 要下载,教程里用 sklearn 自带的 digits 小数据集,形状更小、运行更快,但任务本质一样。StandardScaler 像统一亮度刻度,把像素特征拉到更适合训练的范围,避免某些特征因为数值大而“嗓门太大”。

核心源码(逐字来自文末完整源码)

def load_data() -> tuple:
    digits = load_digits()          # 8×8 像素,10类,1797样本
    X, y = digits.data, digits.target
    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    return train_test_split(X, y, test_size=0.2, random_state=42)

可运行演示(补齐 Mock 数据与 print 反馈)

import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def load_data() -> tuple:
    digits = load_digits()          # 8×8 像素,10类,1797样本
    X, y = digits.data, digits.target
    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    return train_test_split(X, y, test_size=0.2, random_state=42)

X_tr, X_te, y_tr, y_te = load_data()
print("训练集:", X_tr.shape, "测试集:", X_te.shape)
print("特征均值约为:", round(float(X_tr.mean()), 4))
print("特征标准差约为:", round(float(X_tr.std()), 4))
print("前 10 个标签:", y_tr[:10].tolist())
print("说明:digits 是 sklearn 内置 8x8 手写数字,不需要联网下载。")

Step 6:用 mode_data 把 8×8 数字画成终端字符图

痛点与机制

图像数据如果只看一串 64 维数字,新手很难建立直觉。mode_data() 把像素亮度转成 ·,相当于在终端里画迷你数字。看到字符图之后,读者会明白:模型其实就是在根据这些亮暗格子判断数字。

核心源码(逐字来自文末完整源码)

def mode_data() -> None:
    print("\n" + "="*60 + "\n  数据集可视化(sklearn digits,8×8 像素)\n" + "="*60)
    digits = load_digits()
    X, y = digits.data, digits.target
    print(f"\n  数据集: {X.shape[0]} 样本  特征维度: {X.shape[1]}(8×8展平)")
    print(f"  类别: {sorted(set(y))}  每类约 {len(y)//10} 个样本\n")

    # ASCII 渲染前10个数字
    print("  前10个样本的 ASCII 渲染(阈值=8):")
    for idx in range(10):
        img = digits.images[idx]   # (8, 8)
        label = y[idx]
        row_str = "  ".join(
            "".join("█" if px > 8 else "·" for px in row)
            for row in img
        )
        # 只打印第4行作为缩略
        mid_row = "".join("█" if px > 8 else "·" for px in img[3])
        print(f"  [{label}] {mid_row}")

    print("\n  完整渲染(数字 '3'):")
    idx_3 = np.where(y == 3)[0][0]
    for row in digits.images[idx_3]:
        print("  " + "".join("█" if px > 8 else "·" for px in row))

可运行演示(补齐 Mock 数据与 print 反馈)

import numpy as np
from sklearn.datasets import load_digits

def mode_data() -> None:
    print("\n" + "="*60 + "\n  数据集可视化(sklearn digits,8×8 像素)\n" + "="*60)
    digits = load_digits()
    X, y = digits.data, digits.target
    print(f"\n  数据集: {X.shape[0]} 样本  特征维度: {X.shape[1]}(8×8展平)")
    print(f"  类别: {sorted(set(y))}  每类约 {len(y)//10} 个样本\n")

    # ASCII 渲染前10个数字
    print("  前10个样本的 ASCII 渲染(阈值=8):")
    for idx in range(10):
        img = digits.images[idx]   # (8, 8)
        label = y[idx]
        row_str = "  ".join(
            "".join("█" if px > 8 else "·" for px in row)
            for row in img
        )
        # 只打印第4行作为缩略
        mid_row = "".join("█" if px > 8 else "·" for px in img[3])
        print(f"  [{label}] {mid_row}")

    print("\n  完整渲染(数字 '3'):")
    idx_3 = np.where(y == 3)[0][0]
    for row in digits.images[idx_3]:
        print("  " + "".join("█" if px > 8 else "·" for px in row))

mode_data()

Step 7:用 mode_train 训练 64→128→64→10 的全连接网络

痛点与机制

这一段把前面所有积木串起来:加载数据、one-hot 标签、创建 MLP、训练 300 轮、打印损失曲线和测试准确率。损失条形图像体温计,越短说明模型错得越少;准确率则告诉我们最终识别效果。

核心源码(逐字来自文末完整源码)

def mode_train() -> None:
    global _trained_model, _test_data
    print("\n" + "="*60 + "\n  训练全连接网络(64→128→64→10)\n" + "="*60)
    X_tr, X_te, y_tr, y_te = load_data()
    _test_data = (X_te, y_te)

    y_tr_oh = onehot(y_tr, 10)
    model = MLP(layer_sizes=[64, 128, 64, 10], lr=0.05)
    print(f"\n  网络: 64→128(ReLU)→64(ReLU)→10(Softmax)")
    print(f"  训练集: {len(X_tr)}  测试集: {len(X_te)}  epochs=300\n")

    history = model.fit(X_tr, y_tr_oh, epochs=300, batch_size=32)
    _trained_model = model

    # 损失曲线
    print("  训练损失曲线(每60轮)")
    checkpoints = [0, 60, 120, 180, 240, 299]
    max_loss = history[0]
    W = 35
    for ep in checkpoints:
        loss = history[ep]
        bar = "█" * int(loss / max_loss * W)
        print(f"  epoch {ep+1:>3} │{bar:<{W}}│ {loss:.4f}")

    acc = accuracy_score(y_te, model.predict(X_te))
    print(f"\n  测试准确率: {acc:.4f}  ({acc*100:.1f}%)")

可运行演示(补齐 Mock 数据与 print 反馈)

from typing import Optional

import numpy as np
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def relu(z: np.ndarray) -> np.ndarray:
    return np.maximum(0.0, z)

def relu_grad(z: np.ndarray) -> np.ndarray:
    return (z > 0).astype(float)

def softmax(z: np.ndarray) -> np.ndarray:
    e = np.exp(z - z.max(axis=1, keepdims=True))
    return e / e.sum(axis=1, keepdims=True)

def cross_entropy(pred: np.ndarray, true: np.ndarray) -> float:
    return -np.sum(true * np.log(pred + 1e-9)) / true.shape[0]

def onehot(y: np.ndarray, n_classes: int) -> np.ndarray:
    oh = np.zeros((len(y), n_classes))
    oh[np.arange(len(y)), y] = 1
    return oh

class MLP:
    """多层全连接网络,支持任意层数。"""

    def __init__(self, layer_sizes: list[int], lr: float = 0.01,
                 seed: int = 42) -> None:
        rng = np.random.RandomState(seed)
        self.weights: list[np.ndarray] = []
        self.biases:  list[np.ndarray] = []
        for i in range(len(layer_sizes) - 1):
            n_in, n_out = layer_sizes[i], layer_sizes[i + 1]
            self.weights.append(rng.randn(n_in, n_out) * np.sqrt(2.0 / n_in))
            self.biases.append(np.zeros((1, n_out)))
        self.lr = lr
        self._cache: list[dict] = []

    def forward(self, X: np.ndarray) -> np.ndarray:
        self._cache = []
        A = X
        for i, (W, b) in enumerate(zip(self.weights, self.biases)):
            Z = A @ W + b
            is_last = (i == len(self.weights) - 1)
            A_next = softmax(Z) if is_last else relu(Z)
            self._cache.append({"A_in": A, "Z": Z, "A_out": A_next})
            A = A_next
        return A

    def backward(self, y_true: np.ndarray) -> None:
        n = y_true.shape[0]
        # 输出层梯度
        dZ = (self._cache[-1]["A_out"] - y_true) / n
        for i in reversed(range(len(self.weights))):
            A_in = self._cache[i]["A_in"]
            Z    = self._cache[i]["Z"]
            dW = A_in.T @ dZ
            db = dZ.sum(axis=0, keepdims=True)
            if i > 0:
                dA = dZ @ self.weights[i].T
                dZ = dA * relu_grad(self._cache[i - 1]["Z"])
            self.weights[i] -= self.lr * dW
            self.biases[i]  -= self.lr * db

    def fit(self, X: np.ndarray, y_oh: np.ndarray,
            epochs: int = 300, batch_size: int = 64) -> list[float]:
        n = X.shape[0]
        history: list[float] = []
        for epoch in range(1, epochs + 1):
            idx = np.random.permutation(n)
            X_s, y_s = X[idx], y_oh[idx]
            loss = 0.0
            for s in range(0, n, batch_size):
                Xb, yb = X_s[s:s+batch_size], y_s[s:s+batch_size]
                pred = self.forward(Xb)
                loss += cross_entropy(pred, yb) * len(Xb)
                self.backward(yb)
            history.append(loss / n)
        return history

    def predict(self, X: np.ndarray) -> np.ndarray:
        return self.forward(X).argmax(axis=1)

def load_data() -> tuple:
    digits = load_digits()          # 8×8 像素,10类,1797样本
    X, y = digits.data, digits.target
    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    return train_test_split(X, y, test_size=0.2, random_state=42)

_trained_model: Optional[MLP] = None
_test_data: Optional[tuple] = None

def mode_train() -> None:
    global _trained_model, _test_data
    print("\n" + "="*60 + "\n  训练全连接网络(64→128→64→10)\n" + "="*60)
    X_tr, X_te, y_tr, y_te = load_data()
    _test_data = (X_te, y_te)

    y_tr_oh = onehot(y_tr, 10)
    model = MLP(layer_sizes=[64, 128, 64, 10], lr=0.05)
    print(f"\n  网络: 64→128(ReLU)→64(ReLU)→10(Softmax)")
    print(f"  训练集: {len(X_tr)}  测试集: {len(X_te)}  epochs=300\n")

    history = model.fit(X_tr, y_tr_oh, epochs=300, batch_size=32)
    _trained_model = model

    # 损失曲线
    print("  训练损失曲线(每60轮)")
    checkpoints = [0, 60, 120, 180, 240, 299]
    max_loss = history[0]
    W = 35
    for ep in checkpoints:
        loss = history[ep]
        bar = "█" * int(loss / max_loss * W)
        print(f"  epoch {ep+1:>3}{bar:<{W}}{loss:.4f}")

    acc = accuracy_score(y_te, model.predict(X_te))
    print(f"\n  测试准确率: {acc:.4f}  ({acc*100:.1f}%)")

mode_train()

Step 8:用 mode_eval 看每个数字的准确率和混淆矩阵

痛点与机制

只看整体准确率还不够,因为模型可能特别擅长识别 1,却经常把 9 看成 8。混淆矩阵像错题分类表:行是真实数字,列是预测数字,非对角线的数字越多,说明这两类越容易混淆。

核心源码(逐字来自文末完整源码)

def mode_eval() -> None:
    global _trained_model, _test_data
    print("\n" + "="*60 + "\n  评估与混淆矩阵分析\n" + "="*60)

    if _trained_model is None:
        mode_train()

    X_te, y_te = _test_data
    y_pred = _trained_model.predict(X_te)
    acc = accuracy_score(y_te, y_pred)
    cm = confusion_matrix(y_te, y_pred)

    # 每类准确率
    print(f"\n  整体准确率: {acc:.4f}\n")
    print(f"  {'数字':<6} {'正确':<6} {'总数':<6} {'准确率':<8} 条形图")
    print(f"  {'─'*50}")
    for digit in range(10):
        mask = y_te == digit
        correct = (y_pred[mask] == digit).sum()
        total = mask.sum()
        rate = correct / total if total > 0 else 0
        bar = "█" * int(rate * 20)
        print(f"  {digit:<6} {correct:<6} {total:<6} {rate:.3f}    {bar}")

    # ASCII 混淆矩阵(只显示错误)
    print(f"\n  混淆矩阵(行=真实,列=预测,只显示非零错误):")
    print(f"  真\\预  " + "  ".join(f"{i:2}" for i in range(10)))
    print(f"  {'─'*40}")
    for i in range(10):
        row_str = "  ".join(
            f"\033[31m{cm[i,j]:2}\033[0m" if cm[i,j] > 0 and i != j
            else f"{cm[i,j]:2}"
            for j in range(10)
        )
        print(f"  {i:2}     {row_str}")

可运行演示(补齐 Mock 数据与 print 反馈)

from typing import Optional

import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix

class TinyModel:
    def predict(self, X: np.ndarray) -> np.ndarray:
        # 演示用:故意让一个样本预测错,方便看到混淆矩阵里的红色错误。
        return np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 8])

_trained_model: Optional[TinyModel] = TinyModel()
_test_data: Optional[tuple] = (np.zeros((10, 64)), np.arange(10))

def mode_train() -> None:
    print("演示环境已准备 TinyModel,不需要重新训练。")

def mode_eval() -> None:
    global _trained_model, _test_data
    print("\n" + "="*60 + "\n  评估与混淆矩阵分析\n" + "="*60)

    if _trained_model is None:
        mode_train()

    X_te, y_te = _test_data
    y_pred = _trained_model.predict(X_te)
    acc = accuracy_score(y_te, y_pred)
    cm = confusion_matrix(y_te, y_pred)

    # 每类准确率
    print(f"\n  整体准确率: {acc:.4f}\n")
    print(f"  {'数字':<6} {'正确':<6} {'总数':<6} {'准确率':<8} 条形图")
    print(f"  {'─'*50}")
    for digit in range(10):
        mask = y_te == digit
        correct = (y_pred[mask] == digit).sum()
        total = mask.sum()
        rate = correct / total if total > 0 else 0
        bar = "█" * int(rate * 20)
        print(f"  {digit:<6} {correct:<6} {total:<6} {rate:.3f}    {bar}")

    # ASCII 混淆矩阵(只显示错误)
    print(f"\n  混淆矩阵(行=真实,列=预测,只显示非零错误):")
    print(f"  真\\预  " + "  ".join(f"{i:2}" for i in range(10)))
    print(f"  {'─'*40}")
    for i in range(10):
        row_str = "  ".join(
            f"\033[31m{cm[i,j]:2}\033[0m" if cm[i,j] > 0 and i != j
            else f"{cm[i,j]:2}"
            for j in range(10)
        )
        print(f"  {i:2}     {row_str}")

mode_eval()

极客实战:完整源码与运行

现在,把上面的积木拼起来,将下面完整代码保存为 49-python-mnist-mlp.py。它使用 sklearn 内置 digits 数据集,不需要下载外部图片,就能完成数据可视化、MLP 训练和混淆矩阵评估。

#!/usr/bin/env python3
"""
49-python-mnist-mlp.py — MNIST 风格手写数字识别

用法:
  python3 49-python-mnist-mlp.py --mode data     # 数据集可视化
  python3 49-python-mnist-mlp.py --mode train    # 训练全连接网络
  python3 49-python-mnist-mlp.py --mode eval     # 评估与混淆矩阵
  python3 49-python-mnist-mlp.py --mode all      # 全部(默认)

依赖 numpy + scikit-learn,直接运行。
使用 sklearn 内置的 digits 数据集(8×8 像素,1797 样本),
与 MNIST(28×28,70000 样本)同类问题,无需下载。
"""

import argparse
from typing import Optional

import numpy as np
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler



def relu(z: np.ndarray) -> np.ndarray:
    return np.maximum(0.0, z)

def relu_grad(z: np.ndarray) -> np.ndarray:
    return (z > 0).astype(float)

def softmax(z: np.ndarray) -> np.ndarray:
    e = np.exp(z - z.max(axis=1, keepdims=True))
    return e / e.sum(axis=1, keepdims=True)

def cross_entropy(pred: np.ndarray, true: np.ndarray) -> float:
    return -np.sum(true * np.log(pred + 1e-9)) / true.shape[0]

def onehot(y: np.ndarray, n_classes: int) -> np.ndarray:
    oh = np.zeros((len(y), n_classes))
    oh[np.arange(len(y)), y] = 1
    return oh


class MLP:
    """多层全连接网络,支持任意层数。"""

    def __init__(self, layer_sizes: list[int], lr: float = 0.01,
                 seed: int = 42) -> None:
        rng = np.random.RandomState(seed)
        self.weights: list[np.ndarray] = []
        self.biases:  list[np.ndarray] = []
        for i in range(len(layer_sizes) - 1):
            n_in, n_out = layer_sizes[i], layer_sizes[i + 1]
            self.weights.append(rng.randn(n_in, n_out) * np.sqrt(2.0 / n_in))
            self.biases.append(np.zeros((1, n_out)))
        self.lr = lr
        self._cache: list[dict] = []

    def forward(self, X: np.ndarray) -> np.ndarray:
        self._cache = []
        A = X
        for i, (W, b) in enumerate(zip(self.weights, self.biases)):
            Z = A @ W + b
            is_last = (i == len(self.weights) - 1)
            A_next = softmax(Z) if is_last else relu(Z)
            self._cache.append({"A_in": A, "Z": Z, "A_out": A_next})
            A = A_next
        return A

    def backward(self, y_true: np.ndarray) -> None:
        n = y_true.shape[0]
        # 输出层梯度
        dZ = (self._cache[-1]["A_out"] - y_true) / n
        for i in reversed(range(len(self.weights))):
            A_in = self._cache[i]["A_in"]
            Z    = self._cache[i]["Z"]
            dW = A_in.T @ dZ
            db = dZ.sum(axis=0, keepdims=True)
            if i > 0:
                dA = dZ @ self.weights[i].T
                dZ = dA * relu_grad(self._cache[i - 1]["Z"])
            self.weights[i] -= self.lr * dW
            self.biases[i]  -= self.lr * db

    def fit(self, X: np.ndarray, y_oh: np.ndarray,
            epochs: int = 300, batch_size: int = 64) -> list[float]:
        n = X.shape[0]
        history: list[float] = []
        for epoch in range(1, epochs + 1):
            idx = np.random.permutation(n)
            X_s, y_s = X[idx], y_oh[idx]
            loss = 0.0
            for s in range(0, n, batch_size):
                Xb, yb = X_s[s:s+batch_size], y_s[s:s+batch_size]
                pred = self.forward(Xb)
                loss += cross_entropy(pred, yb) * len(Xb)
                self.backward(yb)
            history.append(loss / n)
        return history

    def predict(self, X: np.ndarray) -> np.ndarray:
        return self.forward(X).argmax(axis=1)

# ─── 数据加载 ──────────────────────────────────────────────────────────────────

def load_data() -> tuple:
    digits = load_digits()          # 8×8 像素,10类,1797样本
    X, y = digits.data, digits.target
    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    return train_test_split(X, y, test_size=0.2, random_state=42)

# ─── 模式1:数据可视化 ─────────────────────────────────────────────────────────

def mode_data() -> None:
    print("\n" + "="*60 + "\n  数据集可视化(sklearn digits,8×8 像素)\n" + "="*60)
    digits = load_digits()
    X, y = digits.data, digits.target
    print(f"\n  数据集: {X.shape[0]} 样本  特征维度: {X.shape[1]}(8×8展平)")
    print(f"  类别: {sorted(set(y))}  每类约 {len(y)//10} 个样本\n")

    # ASCII 渲染前10个数字
    print("  前10个样本的 ASCII 渲染(阈值=8):")
    for idx in range(10):
        img = digits.images[idx]   # (8, 8)
        label = y[idx]
        row_str = "  ".join(
            "".join("█" if px > 8 else "·" for px in row)
            for row in img
        )
        # 只打印第4行作为缩略
        mid_row = "".join("█" if px > 8 else "·" for px in img[3])
        print(f"  [{label}] {mid_row}")

    print("\n  完整渲染(数字 '3'):")
    idx_3 = np.where(y == 3)[0][0]
    for row in digits.images[idx_3]:
        print("  " + "".join("█" if px > 8 else "·" for px in row))

# ─── 模式2:训练 ───────────────────────────────────────────────────────────────

_trained_model: Optional[MLP] = None
_test_data: Optional[tuple] = None

def mode_train() -> None:
    global _trained_model, _test_data
    print("\n" + "="*60 + "\n  训练全连接网络(64→128→64→10)\n" + "="*60)
    X_tr, X_te, y_tr, y_te = load_data()
    _test_data = (X_te, y_te)

    y_tr_oh = onehot(y_tr, 10)
    model = MLP(layer_sizes=[64, 128, 64, 10], lr=0.05)
    print(f"\n  网络: 64→128(ReLU)→64(ReLU)→10(Softmax)")
    print(f"  训练集: {len(X_tr)}  测试集: {len(X_te)}  epochs=300\n")

    history = model.fit(X_tr, y_tr_oh, epochs=300, batch_size=32)
    _trained_model = model

    # 损失曲线
    print("  训练损失曲线(每60轮)")
    checkpoints = [0, 60, 120, 180, 240, 299]
    max_loss = history[0]
    W = 35
    for ep in checkpoints:
        loss = history[ep]
        bar = "█" * int(loss / max_loss * W)
        print(f"  epoch {ep+1:>3}{bar:<{W}}{loss:.4f}")

    acc = accuracy_score(y_te, model.predict(X_te))
    print(f"\n  测试准确率: {acc:.4f}  ({acc*100:.1f}%)")

# ─── 模式3:评估与混淆矩阵 ────────────────────────────────────────────────────

def mode_eval() -> None:
    global _trained_model, _test_data
    print("\n" + "="*60 + "\n  评估与混淆矩阵分析\n" + "="*60)

    if _trained_model is None:
        mode_train()

    X_te, y_te = _test_data
    y_pred = _trained_model.predict(X_te)
    acc = accuracy_score(y_te, y_pred)
    cm = confusion_matrix(y_te, y_pred)

    # 每类准确率
    print(f"\n  整体准确率: {acc:.4f}\n")
    print(f"  {'数字':<6} {'正确':<6} {'总数':<6} {'准确率':<8} 条形图")
    print(f"  {'─'*50}")
    for digit in range(10):
        mask = y_te == digit
        correct = (y_pred[mask] == digit).sum()
        total = mask.sum()
        rate = correct / total if total > 0 else 0
        bar = "█" * int(rate * 20)
        print(f"  {digit:<6} {correct:<6} {total:<6} {rate:.3f}    {bar}")

    # ASCII 混淆矩阵(只显示错误)
    print(f"\n  混淆矩阵(行=真实,列=预测,只显示非零错误):")
    print(f"  真\\预  " + "  ".join(f"{i:2}" for i in range(10)))
    print(f"  {'─'*40}")
    for i in range(10):
        row_str = "  ".join(
            f"\033[31m{cm[i,j]:2}\033[0m" if cm[i,j] > 0 and i != j
            else f"{cm[i,j]:2}"
            for j in range(10)
        )
        print(f"  {i:2}     {row_str}")

# ─── 入口 ─────────────────────────────────────────────────────────────────────

def main() -> None:
    parser = argparse.ArgumentParser(description="MNIST 风格手写数字识别")
    parser.add_argument(
        "--mode",
        choices=["data", "train", "eval", "all"],
        default="all",
    )
    args = parser.parse_args()
    dispatch = {
        "data":  mode_data,
        "train": mode_train,
        "eval":  mode_eval,
        "all":   lambda: [mode_data(), mode_train(), mode_eval()],
    }
    dispatch[args.mode]()


if __name__ == "__main__":
    main()
$ python 49-python-mnist-mlp.py --mode data
============================================================
  数据集可视化(sklearn digits,8×8 像素)
============================================================

  数据集: 1797 样本  特征维度: 64(8×8展平)
  类别: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]  每类约 179 个样本

  前10个样本的 ASCII 渲染(阈值=8):
  [0] ··█·····
  [1] ··███···
  [2] ····██··
  [3] ···██···
  [4] ···█·█··
  [5] ··███···
  [6] ··██····
  [7] ····██··
  [8] ···███··
  [9] ··█·██··

  完整渲染(数字 '3'):
  ···██···
  ··█·█···
  ···██···
  ···██···
  ····██··
  ·····█··
  ·····██·
  ···███··

$ python 49-python-mnist-mlp.py --mode eval
============================================================
  评估与混淆矩阵分析
============================================================

============================================================
  训练全连接网络(64→128→64→10)
============================================================

  网络: 64→128(ReLU)→64(ReLU)→10(Softmax)
  训练集: 1437  测试集: 360  epochs=300

  训练损失曲线(每60轮)
  epoch   1 │███████████████████████████████████│ 1.1440
  epoch  61 │                                   │ 0.0031
  epoch 121 │                                   │ 0.0013
  epoch 181 │                                   │ 0.0008
  epoch 241 │                                   │ 0.0005
  epoch 300 │                                   │ 0.0004

  测试准确率: 0.9833  (98.3%)

  整体准确率: 0.9833

  数字     正确     总数     准确率      条形图
  ──────────────────────────────────────────────────
  0      33     33     1.000    ████████████████████
  1      28     28     1.000    ████████████████████
  2      33     33     1.000    ████████████████████
  3      32     34     0.941    ██████████████████
  4      46     46     1.000    ████████████████████
  5      46     47     0.979    ███████████████████
  6      34     35     0.971    ███████████████████
  7      33     34     0.971    ███████████████████
  8      30     30     1.000    ████████████████████
  9      39     40     0.975    ███████████████████

  混淆矩阵(行=真实,列=预测,只显示非零错误):
\预   0   1   2   3   4   5   6   7   8   9
  ────────────────────────────────────────
   0     33   0   0   0   0   0   0   0   0   0
   1      0  28   0   0   0   0   0   0   0   0
   2      0   0  33   0   0   0   0   0   0   0
   3      0   0   1  32   0   1   0   0   0   0
   4      0   0   0   0  46   0   0   0   0   0
   5      0   0   0   0   0  46   1   0   0   0
   6      0   0   0   0   0   1  34   0   0   0

小结与 NexDo Time ⚡

这一篇你完成了一个真正的图像分类闭环:把 8x8 像素变成 64 维特征,把数字标签变成 one-hot,把 MLP 输出变成 10 类概率,再用混淆矩阵检查模型到底错在哪里。

5 分钟微操挑战:把 MLP(layer_sizes=[64, 128, 64, 10], lr=0.05) 改成 [64, 32, 10][64, 256, 128, 10],分别运行 --mode eval,比较准确率和训练耗时。

Don’t wait for next time, do it in the next moment.