文章

43 · 文本挖掘：TF-IDF 与 NLP 初探

#052 · 2026-04-17 · Python

Reading Path / ARTICLE 先抓主张，再转成行动 #052 · Python · 读完进入产品或下一篇

阅读数据加载中… 点赞数据加载中…

生成页数

幻灯片语言

提炼重点 / 自定义指令 (可选)

🔗 知识图谱导航：阅读本文前，建议先回顾《42 · 无监督学习：K-Means 聚类与异常检测》中的高维向量表示与特征标准化概念；本文将带领你从一维坐标空间跨越到稀疏且非结构化的文本特征空间，解锁自然语言处理的基本手段。 承上启下：在上一篇《42 · 无监督学习：K-Means 聚类与异常检测》中，我们利用几何坐标点完成了聚类和离群点检测。但是，现实世界中还存在大量由文字构成的非结构化文本数据，如何将这些包含自然语言规律的信息提取为模型可以理解的“特征向量”？本篇我们将从“词频统计”与“信息增益率”的原理出发，深入探讨 TF-IDF (词频-逆文档频率) 算法。我们将手写一个 TF-IDF 提取器，并结合 朴素贝叶斯 进行文本分类，同时利用 余弦相似度 (Cosine Similarity) 构建一个简易的搜索引擎，揭开自然语言处理（NLP）的神秘面纱。

运行环境：pip install numpy scikit-learn。本文使用内置 Mock 英文短文本语料，不下载数据集，不依赖外部 API。

痛点与架构：计算机不能直接理解一句话，它只能处理数字。NLP 的第一步就是把文本拆成词，再把词变成向量。本文用 TF-IDF 提取关键词，用文本分类器贴类别，再用余弦相似度找相似文章。

NLP 先建立直觉

原始文本 -> 分词 -> 词表 -> 数字向量 -> 分类 / 搜索 / 相似度

极客解析：文本向量化像把一篇文章变成“调料配方表”：每个词是一种调料，TF-IDF 分数是用量。两篇文章配方越像，主题通常越接近。

步步为营：核心逻辑自适应拆解

这一篇拆成 6 个台阶：先用表格建立 NLP 地图，再手写 TF-IDF，接着做关键词提取、文本分类、文档相似度，最后用 argparse 串成可运行脚本。

Step 1：用 print_table 把 NLP 流程讲成一张小表

痛点与机制：

print_table 是终端里的排版工具。NLP 对新手最难的不是代码，而是概念太多：文档、词表、向量、权重。先把这些概念摆成表，就像给厨房里的食材贴标签，后面真正下锅时不会乱。

核心源码（逐字来自文末完整源码）：

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")

可运行演示（补齐 Mock 数据与 print 反馈）：

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")

print_table(
    ["NLP对象", "小白理解", "后面怎么用"],
    [
        ["文档", "一篇短文本", "先切成词"],
        ["词表", "所有出现过的词", "变成向量列"],
        ["TF-IDF", "词的重要性分数", "提关键词/做分类/算相似"],
    ],
    "文本挖掘三件套",
)

Step 2：用 TFIDFFromScratch 手写词的重要性打分器

痛点与机制：

TF-IDF 像“特色菜评分”：某个词在当前文章里常出现，TF 高；但如果每篇文章都有它，IDF 会压低分数。最终高分词往往既频繁又有区分度，适合拿来做关键词。

核心源码（逐字来自文末完整源码）：

class TFIDFFromScratch:
    def __init__(self):
        self.idf_: Dict[str, float] = {}
        self.vocab_: List[str] = []

    def fit(self, docs: List[str]) -> "TFIDFFromScratch":
        n = len(docs)
        df: Counter = Counter()
        for doc in docs:
            for word in set(doc.split()):
                df[word] += 1
        self.idf_ = {w: math.log(n / (cnt + 1)) + 1 for w, cnt in df.items()}
        self.vocab_ = sorted(self.idf_.keys())
        return self

    def transform(self, docs: List[str]) -> np.ndarray:
        matrix = np.zeros((len(docs), len(self.vocab_)))
        for i, doc in enumerate(docs):
            words = doc.split()
            tf = Counter(words)
            total = len(words)
            for j, word in enumerate(self.vocab_):
                if word in tf:
                    matrix[i, j] = (tf[word] / total) * self.idf_[word]
        return matrix

    def top_keywords(self, doc: str, n: int = 5) -> List[Tuple[str, float]]:
        words = doc.split()
        tf = Counter(words)
        total = len(words)
        scores = {w: (tf[w] / total) * self.idf_.get(w, 0) for w in set(words)}
        return sorted(scores.items(), key=lambda x: -x[1])[:n]

可运行演示（补齐 Mock 数据与 print 反馈）：

import math
from collections import Counter
from typing import Dict, List, Tuple
import numpy as np

class TFIDFFromScratch:
    def __init__(self):
        self.idf_: Dict[str, float] = {}
        self.vocab_: List[str] = []

    def fit(self, docs: List[str]) -> "TFIDFFromScratch":
        n = len(docs)
        df: Counter = Counter()
        for doc in docs:
            for word in set(doc.split()):
                df[word] += 1
        self.idf_ = {w: math.log(n / (cnt + 1)) + 1 for w, cnt in df.items()}
        self.vocab_ = sorted(self.idf_.keys())
        return self

    def transform(self, docs: List[str]) -> np.ndarray:
        matrix = np.zeros((len(docs), len(self.vocab_)))
        for i, doc in enumerate(docs):
            words = doc.split()
            tf = Counter(words)
            total = len(words)
            for j, word in enumerate(self.vocab_):
                if word in tf:
                    matrix[i, j] = (tf[word] / total) * self.idf_[word]
        return matrix

    def top_keywords(self, doc: str, n: int = 5) -> List[Tuple[str, float]]:
        words = doc.split()
        tf = Counter(words)
        total = len(words)
        scores = {w: (tf[w] / total) * self.idf_.get(w, 0) for w in set(words)}
        return sorted(scores.items(), key=lambda x: -x[1])[:n]

docs = [
    "python data science python",
    "finance market risk",
    "python machine learning data",
]
tfidf = TFIDFFromScratch().fit(docs)
matrix = tfidf.transform(docs)
print("词表:", tfidf.vocab_)
print("矩阵形状:", matrix.shape)
print("第一篇 Top 关键词:")
for word, score in tfidf.top_keywords(docs[0], n=3):
    print(f"  {word:<10} {score:.4f}")

Step 3：用 mode_tfidf 提取每篇文档的 Top 关键词

痛点与机制：

mode_tfidf 把手写 TF-IDF 用到真实小语料上，输出每篇文档最能代表主题的词。它像给文章自动贴标签：技术文档可能贴出 learning/model，金融文档可能贴出 market/risk。

核心源码（逐字来自文末完整源码）：

def mode_tfidf() -> None:
    print(f"[{nexdo_time()}] 从零实现 TF-IDF")
    tfidf = TFIDFFromScratch()
    tfidf.fit(DOCS)

    rows = []
    for i, (label, doc) in enumerate(CORPUS[:5]):
        keywords = tfidf.top_keywords(doc, n=3)
        kw_str = " | ".join(f"{w}({s:.3f})" for w, s in keywords)
        rows.append([i, label, kw_str])
    print_table(["文档", "类别", "Top3关键词(TF-IDF分数)"], rows, "TF-IDF 关键词提取（前5篇）")

可运行演示（补齐 Mock 数据与 print 反馈）：

import math
import time
from collections import Counter
from typing import Dict, List, Tuple
import numpy as np


def nexdo_time() -> str:
    return time.strftime("%Y-%m-%d %H:%M:%S")

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")

CORPUS = [
    ("tech", "python machine learning model training"),
    ("tech", "neural network deep learning model"),
    ("tech", "data science algorithm optimization"),
    ("finance", "stock market investment risk"),
    ("finance", "bank interest rate inflation"),
    ("finance", "startup funding valuation equity"),
    ("health", "exercise nutrition healthy lifestyle"),
    ("health", "sleep stress mental health"),
    ("health", "heart disease blood pressure"),
]
LABELS = [label for label, _ in CORPUS]
DOCS = [doc for _, doc in CORPUS]

class TFIDFFromScratch:
    def __init__(self):
        self.idf_: Dict[str, float] = {}
        self.vocab_: List[str] = []

    def fit(self, docs: List[str]) -> "TFIDFFromScratch":
        n = len(docs)
        df: Counter = Counter()
        for doc in docs:
            for word in set(doc.split()):
                df[word] += 1
        self.idf_ = {w: math.log(n / (cnt + 1)) + 1 for w, cnt in df.items()}
        self.vocab_ = sorted(self.idf_.keys())
        return self

    def transform(self, docs: List[str]) -> np.ndarray:
        matrix = np.zeros((len(docs), len(self.vocab_)))
        for i, doc in enumerate(docs):
            words = doc.split()
            tf = Counter(words)
            total = len(words)
            for j, word in enumerate(self.vocab_):
                if word in tf:
                    matrix[i, j] = (tf[word] / total) * self.idf_[word]
        return matrix

    def top_keywords(self, doc: str, n: int = 5) -> List[Tuple[str, float]]:
        words = doc.split()
        tf = Counter(words)
        total = len(words)
        scores = {w: (tf[w] / total) * self.idf_.get(w, 0) for w in set(words)}
        return sorted(scores.items(), key=lambda x: -x[1])[:n]

def mode_tfidf() -> None:
    print(f"[{nexdo_time()}] 从零实现 TF-IDF")
    tfidf = TFIDFFromScratch()
    tfidf.fit(DOCS)

    rows = []
    for i, (label, doc) in enumerate(CORPUS[:5]):
        keywords = tfidf.top_keywords(doc, n=3)
        kw_str = " | ".join(f"{w}({s:.3f})" for w, s in keywords)
        rows.append([i, label, kw_str])
    print_table(["文档", "类别", "Top3关键词(TF-IDF分数)"], rows, "TF-IDF 关键词提取（前5篇）")

mode_tfidf()

Step 4：用 mode_classify 对比三种文本分类器

痛点与机制：

文本分类要先把句子变成数字向量，再把向量交给分类器。朴素贝叶斯像按词出现概率投票，逻辑回归像给每个词学权重，线性 SVM 像找一条最大间隔分界线。

核心源码（逐字来自文末完整源码）：

def mode_classify() -> None:
    print(f"[{nexdo_time()}] sklearn 文本分类器对比")
    vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=500)
    X = vectorizer.fit_transform(DOCS)
    y = np.array(LABELS)

    rows = []
    for name, clf in [
        ("朴素贝叶斯",   MultinomialNB()),
        ("逻辑回归",     LogisticRegression(max_iter=500, random_state=42)),
        ("线性SVM",      LinearSVC(max_iter=500, random_state=42)),
    ]:
        scores = cross_val_score(clf, X, y, cv=3, scoring="accuracy")
        rows.append([name, f"{scores.mean():.4f}", f"{scores.std():.4f}"])
    print_table(["分类器", "CV均值准确率", "CV标准差"], rows, "文本分类器对比（3折CV）")

可运行演示（补齐 Mock 数据与 print 反馈）：

import time
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score


def nexdo_time() -> str:
    return time.strftime("%Y-%m-%d %H:%M:%S")

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")

CORPUS = [
    ("tech", "python machine learning model training"),
    ("tech", "neural network deep learning model"),
    ("tech", "data science algorithm optimization"),
    ("finance", "stock market investment risk"),
    ("finance", "bank interest rate inflation"),
    ("finance", "startup funding valuation equity"),
    ("health", "exercise nutrition healthy lifestyle"),
    ("health", "sleep stress mental health"),
    ("health", "heart disease blood pressure"),
]
LABELS = [label for label, _ in CORPUS]
DOCS = [doc for _, doc in CORPUS]

def mode_classify() -> None:
    print(f"[{nexdo_time()}] sklearn 文本分类器对比")
    vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=500)
    X = vectorizer.fit_transform(DOCS)
    y = np.array(LABELS)

    rows = []
    for name, clf in [
        ("朴素贝叶斯",   MultinomialNB()),
        ("逻辑回归",     LogisticRegression(max_iter=500, random_state=42)),
        ("线性SVM",      LinearSVC(max_iter=500, random_state=42)),
    ]:
        scores = cross_val_score(clf, X, y, cv=3, scoring="accuracy")
        rows.append([name, f"{scores.mean():.4f}", f"{scores.std():.4f}"])
    print_table(["分类器", "CV均值准确率", "CV标准差"], rows, "文本分类器对比（3折CV）")

mode_classify()

Step 5：用 mode_similarity 找到和查询最像的文档

痛点与机制：

余弦相似度像比较两个箭头的方向：文章长短不同没关系，只要主题词方向接近，相似度就高。搜索、推荐和查重都会用到这种思想。

核心源码（逐字来自文末完整源码）：

def mode_similarity() -> None:
    print(f"[{nexdo_time()}] 文档相似度计算")
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(DOCS).toarray()

    # 计算前5篇文档两两余弦相似度
    sim_matrix = cosine_similarity(X[:5])
    labels_short = [f"D{i}({LABELS[i][:3]})" for i in range(5)]

    print(f"\n  余弦相似度矩阵（前5篇文档）")
    header = f"  {'':12s}" + "".join(f"{l:>12s}" for l in labels_short)
    print(header)
    for i, row_label in enumerate(labels_short):
        line = f"  {row_label:12s}"
        for j in range(5):
            val = sim_matrix[i][j]
            line += f"{val:>12.4f}"
        print(line)

    # 查询文档最相似
    query = "deep learning neural network training optimization"
    q_vec = vectorizer.transform([query]).toarray()
    sims = cosine_similarity(q_vec, X)[0]
    top3 = np.argsort(sims)[::-1][:3]
    rows = [(i+1, LABELS[top3[i]], DOCS[top3[i]][:40]+"...", f"{sims[top3[i]]:.4f}")
            for i in range(3)]
    print_table(["排名", "类别", "文档片段", "相似度"], rows,
                f"查询: '{query[:30]}...' 最相似文档")

可运行演示（补齐 Mock 数据与 print 反馈）：

import time
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


def nexdo_time() -> str:
    return time.strftime("%Y-%m-%d %H:%M:%S")

def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")

CORPUS = [
    ("tech", "python machine learning model training"),
    ("tech", "neural network deep learning model"),
    ("tech", "data science algorithm optimization"),
    ("finance", "stock market investment risk"),
    ("finance", "bank interest rate inflation"),
    ("finance", "startup funding valuation equity"),
    ("health", "exercise nutrition healthy lifestyle"),
    ("health", "sleep stress mental health"),
    ("health", "heart disease blood pressure"),
]
LABELS = [label for label, _ in CORPUS]
DOCS = [doc for _, doc in CORPUS]

def mode_similarity() -> None:
    print(f"[{nexdo_time()}] 文档相似度计算")
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(DOCS).toarray()

    # 计算前5篇文档两两余弦相似度
    sim_matrix = cosine_similarity(X[:5])
    labels_short = [f"D{i}({LABELS[i][:3]})" for i in range(5)]

    print(f"\n  余弦相似度矩阵（前5篇文档）")
    header = f"  {'':12s}" + "".join(f"{l:>12s}" for l in labels_short)
    print(header)
    for i, row_label in enumerate(labels_short):
        line = f"  {row_label:12s}"
        for j in range(5):
            val = sim_matrix[i][j]
            line += f"{val:>12.4f}"
        print(line)

    # 查询文档最相似
    query = "deep learning neural network training optimization"
    q_vec = vectorizer.transform([query]).toarray()
    sims = cosine_similarity(q_vec, X)[0]
    top3 = np.argsort(sims)[::-1][:3]
    rows = [(i+1, LABELS[top3[i]], DOCS[top3[i]][:40]+"...", f"{sims[top3[i]]:.4f}")
            for i in range(3)]
    print_table(["排名", "类别", "文档片段", "相似度"], rows,
                f"查询: '{query[:30]}...' 最相似文档")

mode_similarity()

Step 6：用 main 做 tfidf/classify/similarity/all 命令调度

痛点与机制：

main 是命令行遥控器。读者不用改源码，只要换 --mode，就能分别练关键词提取、文本分类、文档相似度，或用 all 一次跑完整流程。

核心源码（逐字来自文末完整源码）：

def main() -> None:
    parser = argparse.ArgumentParser(description="TF-IDF 与 NLP 文本挖掘演示")
    parser.add_argument("--mode", choices=["tfidf", "classify", "similarity", "all"],
                        default="all")
    args = parser.parse_args()
    dispatch = {
        "tfidf":      mode_tfidf,
        "classify":   mode_classify,
        "similarity": mode_similarity,
        "all":        lambda: [mode_tfidf(), mode_classify(), mode_similarity()],
    }
    dispatch[args.mode]()
    print(f"\n[{nexdo_time()}] 完成")

可运行演示（补齐 Mock 数据与 print 反馈）：

import argparse
import sys


def mode_tfidf() -> None:
    print("tfidf：把文本拆成词，并计算每个词的重要性")


def mode_classify() -> None:
    print("classify：用 TF-IDF 向量训练文本分类器")


def mode_similarity() -> None:
    print("similarity：把文章变成向量后计算相似度")


def nexdo_time() -> str:
    return "2026-04-18 11:05:00"

def main() -> None:
    parser = argparse.ArgumentParser(description="TF-IDF 与 NLP 文本挖掘演示")
    parser.add_argument("--mode", choices=["tfidf", "classify", "similarity", "all"],
                        default="all")
    args = parser.parse_args()
    dispatch = {
        "tfidf":      mode_tfidf,
        "classify":   mode_classify,
        "similarity": mode_similarity,
        "all":        lambda: [mode_tfidf(), mode_classify(), mode_similarity()],
    }
    dispatch[args.mode]()
    print(f"\n[{nexdo_time()}] 完成")

for mode in ["tfidf", "classify", "similarity", "all"]:
    print(f"\n$ python3 43-nlp-tfidf.py --mode {mode}")
    sys.argv = ["prog", "--mode", mode]
    main()

极客实战：完整源码与运行

现在，把上面的积木拼起来，将以下完整代码放进你的编辑器。建议先跑 --mode tfidf 看关键词，再跑 --mode classify 和 --mode similarity 理解分类与搜索。

#!/usr/bin/env python3
"""
43-nlp-tfidf.py
从零实现TF-IDF + sklearn文本分类器 + 关键词提取 + 文档相似度
零外部依赖（仅用collections/math/sklearn）

用法：
  python 43-nlp-tfidf.py --mode tfidf
  python 43-nlp-tfidf.py --mode classify
  python 43-nlp-tfidf.py --mode similarity
  python 43-nlp-tfidf.py --mode all
"""

import argparse
import math
import time
from collections import Counter
from typing import List, Dict, Tuple

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics.pairwise import cosine_similarity


def nexdo_time() -> str:
    return time.strftime("%Y-%m-%d %H:%M:%S")


def print_table(headers: list, rows: list, title: str = "") -> None:
    if title:
        print(f"\n{'='*65}\n  {title}\n{'='*65}")
    col_widths = [max(len(str(h)), max((len(str(r[i])) for r in rows), default=0))
                  for i, h in enumerate(headers)]
    print(f"┌{'┬'.join('─'*(w+2) for w in col_widths)}┐")
    print(f"│{'│'.join(f' {str(h):<{w}} ' for h, w in zip(headers, col_widths))}│")
    print(f"├{'┼'.join('─'*(w+2) for w in col_widths)}┤")
    for row in rows:
        print(f"│{'│'.join(f' {str(v):<{w}} ' for v, w in zip(row, col_widths))}│")
    print(f"└{'┴'.join('─'*(w+2) for w in col_widths)}┘")








CORPUS = [
    ("tech",    "machine learning model training neural network deep learning"),
    ("tech",    "python programming data science algorithm optimization"),
    ("tech",    "artificial intelligence natural language processing transformer"),
    ("tech",    "computer vision image recognition convolutional neural network"),
    ("tech",    "reinforcement learning reward policy gradient agent"),
    ("finance", "stock market investment portfolio risk management"),
    ("finance", "cryptocurrency bitcoin blockchain decentralized finance"),
    ("finance", "interest rate inflation monetary policy central bank"),
    ("finance", "earnings revenue profit loss quarterly report"),
    ("finance", "venture capital startup funding valuation equity"),
    ("health",  "exercise fitness nutrition diet healthy lifestyle"),
    ("health",  "mental health stress anxiety depression therapy"),
    ("health",  "vaccine immunity virus pandemic public health"),
    ("health",  "sleep quality rest recovery performance wellness"),
    ("health",  "heart disease blood pressure cholesterol prevention"),
]

LABELS = [label for label, _ in CORPUS]
DOCS   = [doc   for _, doc   in CORPUS]


# ── 从零实现 TF-IDF ──────────────────────────────────────────────────────────

class TFIDFFromScratch:
    def __init__(self):
        self.idf_: Dict[str, float] = {}
        self.vocab_: List[str] = []

    def fit(self, docs: List[str]) -> "TFIDFFromScratch":
        n = len(docs)
        df: Counter = Counter()
        for doc in docs:
            for word in set(doc.split()):
                df[word] += 1
        self.idf_ = {w: math.log(n / (cnt + 1)) + 1 for w, cnt in df.items()}
        self.vocab_ = sorted(self.idf_.keys())
        return self

    def transform(self, docs: List[str]) -> np.ndarray:
        matrix = np.zeros((len(docs), len(self.vocab_)))
        for i, doc in enumerate(docs):
            words = doc.split()
            tf = Counter(words)
            total = len(words)
            for j, word in enumerate(self.vocab_):
                if word in tf:
                    matrix[i, j] = (tf[word] / total) * self.idf_[word]
        return matrix

    def top_keywords(self, doc: str, n: int = 5) -> List[Tuple[str, float]]:
        words = doc.split()
        tf = Counter(words)
        total = len(words)
        scores = {w: (tf[w] / total) * self.idf_.get(w, 0) for w in set(words)}
        return sorted(scores.items(), key=lambda x: -x[1])[:n]


def mode_tfidf() -> None:
    print(f"[{nexdo_time()}] 从零实现 TF-IDF")
    tfidf = TFIDFFromScratch()
    tfidf.fit(DOCS)

    rows = []
    for i, (label, doc) in enumerate(CORPUS[:5]):
        keywords = tfidf.top_keywords(doc, n=3)
        kw_str = " | ".join(f"{w}({s:.3f})" for w, s in keywords)
        rows.append([i, label, kw_str])
    print_table(["文档", "类别", "Top3关键词(TF-IDF分数)"], rows, "TF-IDF 关键词提取（前5篇）")


def mode_classify() -> None:
    print(f"[{nexdo_time()}] sklearn 文本分类器对比")
    vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=500)
    X = vectorizer.fit_transform(DOCS)
    y = np.array(LABELS)

    rows = []
    for name, clf in [
        ("朴素贝叶斯",   MultinomialNB()),
        ("逻辑回归",     LogisticRegression(max_iter=500, random_state=42)),
        ("线性SVM",      LinearSVC(max_iter=500, random_state=42)),
    ]:
        scores = cross_val_score(clf, X, y, cv=3, scoring="accuracy")
        rows.append([name, f"{scores.mean():.4f}", f"{scores.std():.4f}"])
    print_table(["分类器", "CV均值准确率", "CV标准差"], rows, "文本分类器对比（3折CV）")


def mode_similarity() -> None:
    print(f"[{nexdo_time()}] 文档相似度计算")
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(DOCS).toarray()

    # 计算前5篇文档两两余弦相似度
    sim_matrix = cosine_similarity(X[:5])
    labels_short = [f"D{i}({LABELS[i][:3]})" for i in range(5)]

    print(f"\n  余弦相似度矩阵（前5篇文档）")
    header = f"  {'':12s}" + "".join(f"{l:>12s}" for l in labels_short)
    print(header)
    for i, row_label in enumerate(labels_short):
        line = f"  {row_label:12s}"
        for j in range(5):
            val = sim_matrix[i][j]
            line += f"{val:>12.4f}"
        print(line)

    # 查询文档最相似
    query = "deep learning neural network training optimization"
    q_vec = vectorizer.transform([query]).toarray()
    sims = cosine_similarity(q_vec, X)[0]
    top3 = np.argsort(sims)[::-1][:3]
    rows = [(i+1, LABELS[top3[i]], DOCS[top3[i]][:40]+"...", f"{sims[top3[i]]:.4f}")
            for i in range(3)]
    print_table(["排名", "类别", "文档片段", "相似度"], rows,
                f"查询: '{query[:30]}...' 最相似文档")


def main() -> None:
    parser = argparse.ArgumentParser(description="TF-IDF 与 NLP 文本挖掘演示")
    parser.add_argument("--mode", choices=["tfidf", "classify", "similarity", "all"],
                        default="all")
    args = parser.parse_args()
    dispatch = {
        "tfidf":      mode_tfidf,
        "classify":   mode_classify,
        "similarity": mode_similarity,
        "all":        lambda: [mode_tfidf(), mode_classify(), mode_similarity()],
    }
    dispatch[args.mode]()
    print(f"\n[{nexdo_time()}] 完成")


if __name__ == "__main__":
    main()

$ python3 43-nlp-tfidf.py --mode tfidf
[2026-04-18 11:06:37] 从零实现 TF-IDF

=================================================================
  TF-IDF 关键词提取（前5篇）
=================================================================
┌────┬──────┬─────────────────────────────────────────────────────────────┐
│ 文档 │ 类别   │ Top3关键词(TF-IDF分数)                                           │
├────┼──────┼─────────────────────────────────────────────────────────────┤
│ 0  │ tech │ learning(0.652) | training(0.377) | machine(0.377)          │
│ 1  │ tech │ programming(0.502) | optimization(0.502) | algorithm(0.502) │
│ 2  │ tech │ artificial(0.502) | natural(0.502) | intelligence(0.502)    │
│ 3  │ tech │ image(0.431) | convolutional(0.431) | recognition(0.431)    │
│ 4  │ tech │ gradient(0.502) | reward(0.502) | agent(0.502)              │
└────┴──────┴─────────────────────────────────────────────────────────────┘

[2026-04-18 11:06:37] 完成

$ python3 43-nlp-tfidf.py --mode similarity
[2026-04-18 11:06:38] 文档相似度计算

  余弦相似度矩阵（前5篇文档）
                   D0(tec)     D1(tec)     D2(tec)     D3(tec)     D4(tec)
  D0(tec)           1.0000      0.0000      0.0000      0.2025      0.2201
  D1(tec)           0.0000      1.0000      0.0000      0.0000      0.0000
  D2(tec)           0.0000      0.0000      1.0000      0.0000      0.0000
  D3(tec)           0.2025      0.0000      0.0000      1.0000      0.0000
  D4(tec)           0.2201      0.0000      0.0000      0.0000      1.0000

=================================================================
  查询: 'deep learning neural network t...' 最相似文档
=================================================================
┌────┬──────┬─────────────────────────────────────────────┬────────┐
│ 排名 │ 类别   │ 文档片段                                        │ 相似度    │
├────┼──────┼─────────────────────────────────────────────┼────────┤
│ 1  │ tech │ machine learning model training neural n... │ 0.7490 │
│ 2  │ tech │ computer vision image recognition convol... │ 0.2577 │
│ 3  │ tech │ python programming data science algorith... │ 0.1780 │
└────┴──────┴─────────────────────────────────────────────┴────────┘

小结

模块	你要记住什么
`TFIDFFromScratch`	从词频和逆文档频率手写关键词打分
`mode_tfidf`	给文档自动提取最能代表主题的词
`mode_classify`	把文本向量交给分类器，做主题识别
`mode_similarity`	用余弦相似度做搜索和相似文章查找
`main`	用 `--mode` 拆分 NLP 实验入口

⏱ NexDo Time（5 分钟）

挑战：往 CORPUS 里新增 2 条 sports 主题短文本，再运行 --mode classify，观察 3 折交叉验证分数是否稳定。

Don’t wait for next time, do it in the next moment.

💡 下一篇预告：掌握了非结构化文本的向量化表示、分类与余弦相似度检索后，你已经打开了非结构化自然语言分析的大门。除了文本，日常生活中还存在着更为直观且信息量极大的非结构化媒介——图像。在下一篇《44 · 视觉感知：OpenCV 图像处理与人脸提取》中，我们将调转方向，进入计算机视觉的世界，学习如何用 Python 与 OpenCV 进行像素操作、边缘提取，并调用 Haar 级联分类器完成高效的本地人脸检测！