文章

3 · 逻辑引擎：条件、循环与推导式

#003 · 2026-04-16 · Python

Reading Path / ARTICLE 先抓主张，再转成行动 #003 · Python · 读完进入产品或下一篇

阅读数据加载中… 点赞数据加载中…

生成页数

幻灯片语言

提炼重点 / 自定义指令 (可选)

🔗 知识图谱导航：阅读本文前，建议先掌握/回顾《2 · 数据流转：变量、类型与内存初探》中的核心概念；本文会在这个基础上继续推进。 承上启下：上一篇我们理解了数据如何在内存中流转。现在给数据加上"逻辑"——让程序能判断、能循环、能批量处理。

极客解析：先把数据流、控制流和模块边界跑通，再谈抽象；每段代码都围绕一个可执行 CLI 闭环展开。

痛点与架构

推导式是 Python 最具辨识度的特性，但很多人只会写最简单的形式。本篇用一个知识库文本批处理管道展示推导式的真实威力：

原始文本列表
    ↓ 列表推导式（过滤 + 转换）
清洗后的文本
    ↓ 字典推导式（建立索引）
词频索引表
    ↓ 生成器（惰性处理大文件）
逐批输出结果

实战演练场

步步为营：核心逻辑自适应拆解

导师提示：这一篇不是让你死背 if/for/推导式/yield，而是用一条文本处理流水线把它们串起来。先看每个小零件怎么跑，再看完整脚本怎么把它们接成管道。

Step 1：先准备一组带脏数据的文本

核心源码（逐字来自文末完整源码）：

RAW_CORPUS: list[str] = [
    "  Python is a high-level programming language.  ",
    "",
    "Machine learning requires large datasets.",
    "PYTHON supports multiple programming paradigms.",
    "   ",
    "Deep learning is a subset of machine learning.",
    "Python has a simple and readable syntax.",
    "Data science uses Python extensively.",
    "Neural networks are inspired by the human brain.",
    "Python libraries like NumPy power data analysis.",
]

可运行演示（补齐 Mock 数据与 print 反馈）：

RAW_CORPUS: list[str] = [
    "  Python is a high-level programming language.  ",
    "",
    "Machine learning requires large datasets.",
    "PYTHON supports multiple programming paradigms.",
    "   ",
    "Deep learning is a subset of machine learning.",
    "Python has a simple and readable syntax.",
    "Data science uses Python extensively.",
    "Neural networks are inspired by the human brain.",
    "Python libraries like NumPy power data analysis.",
]

print(f"原始文本一共有 {len(RAW_CORPUS)} 行")
print("前 4 行长这样：")
for line in RAW_CORPUS[:4]:
    print(repr(line))
print("注意：里面故意混了空行、空白字符和大小写不统一的问题。")

大白话解析：真实文本数据很少一上来就干净。RAW_CORPUS 像一筐刚倒出来的菜：有能用的句子，也有空行、前后空格和大小写混乱。后面的逻辑，就是一步步把这筐菜洗干净、切好、分装。

Step 2：用 match-case 按词数给文本贴标签

核心源码（逐字来自文末完整源码）：

def classify_text(text: str) -> str:
    """用 match-case 对文本分类（Python 3.10+）。"""
    word_count = len(text.split())
    match word_count:
        case 0:
            return "empty"
        case 1 | 2:
            return "fragment"
        case n if n <= 8:
            return "sentence"
        case _:
            return "paragraph"

可运行演示（补齐 Mock 数据与 print 反馈）：

def classify_text(text: str) -> str:
    """用 match-case 对文本分类（Python 3.10+）。"""
    word_count = len(text.split())
    match word_count:
        case 0:
            return "empty"
        case 1 | 2:
            return "fragment"
        case n if n <= 8:
            return "sentence"
        case _:
            return "paragraph"

samples = ["", "Python", "Python is great", "This sentence has many words for testing"]
for text in samples:
    print(f"{text!r:<48} => {classify_text(text)}")

大白话解析：match-case 像一个分拣台：0 个词进“empty”，1-2 个词进“fragment”，8 个词以内进“sentence”，再长就归为“paragraph”。新手要先看懂这个分流逻辑，再去写复杂条件。

Step 3：用列表推导式完成三步清洗

核心源码（逐字来自文末完整源码）：

def clean_corpus(raw: list[str]) -> list[str]:
    """
    三步清洗：去空行 → 去首尾空白 → 统一小写。
    等价的 for 循环版本：
        result = []
        for line in raw:
            line = line.strip()
            if line:
                result.append(line.lower())
    """
    return [line.strip().lower() for line in raw if line.strip()]

可运行演示（补齐 Mock 数据与 print 反馈）：

def clean_corpus(raw: list[str]) -> list[str]:
    """
    三步清洗：去空行 → 去首尾空白 → 统一小写。
    等价的 for 循环版本：
        result = []
        for line in raw:
            line = line.strip()
            if line:
                result.append(line.lower())
    """
    return [line.strip().lower() for line in raw if line.strip()]

raw = ["  Python  ", "", " DATA Science ", "   "]
cleaned = clean_corpus(raw)
print("清洗前：", raw)
print("清洗后：", cleaned)
print("三件事一次完成：去空行、去首尾空白、统一小写。")

大白话解析：列表推导式可以理解成一条迷你流水线：for line in raw 逐条拿文本，if line.strip() 丢掉空行，line.strip().lower() 把留下来的文本洗干净。它不是炫技，而是把“过滤 + 转换”写成一眼能读的句子。

Step 4：用倒排索引反查关键词在哪几行

核心源码（逐字来自文末完整源码）：

def build_word_index(corpus: list[str]) -> dict[str, list[int]]:
    """构建词→行号的倒排索引。"""
    index: dict[str, list[int]] = {}
    for line_no, line in enumerate(corpus):
        words = re.findall(r"\b[a-z]+\b", line)
        for word in words:
            index.setdefault(word, []).append(line_no)
    return index

可运行演示（补齐 Mock 数据与 print 反馈）：

import re

def build_word_index(corpus: list[str]) -> dict[str, list[int]]:
    """构建词→行号的倒排索引。"""
    index: dict[str, list[int]] = {}
    for line_no, line in enumerate(corpus):
        words = re.findall(r"\b[a-z]+\b", line)
        for word in words:
            index.setdefault(word, []).append(line_no)
    return index

corpus = ["python is readable", "data uses python", "learning uses data"]
index = build_word_index(corpus)
print("python 出现在这些行：", index.get("python"))
print("data 出现在这些行：", index.get("data"))
print("完整索引：", index)

大白话解析：倒排索引像书后的关键词目录：不是从第 2 行找有哪些词，而是直接问“python 这个词出现在哪几行”。搜索引擎、知识库检索、日志查询，本质都离不开这种反向查表。

Step 5：用 Counter 统计高频词

核心源码（逐字来自文末完整源码）：

def word_frequency(corpus: list[str]) -> dict[str, int]:
    """字典推导式：统计词频。"""
    all_words = [
        word
        for line in corpus
        for word in re.findall(r"\b[a-z]+\b", line)
    ]
    return dict(Counter(all_words).most_common(10))

可运行演示（补齐 Mock 数据与 print 反馈）：

import re
from collections import Counter

def word_frequency(corpus: list[str]) -> dict[str, int]:
    """字典推导式：统计词频。"""
    all_words = [
        word
        for line in corpus
        for word in re.findall(r"\b[a-z]+\b", line)
    ]
    return dict(Counter(all_words).most_common(10))

corpus = ["python data python", "data ai", "python ai"]
freq = word_frequency(corpus)
print("Top 词频：")
for word, count in freq.items():
    print(f"{word:<8} => {count}")

大白话解析：Counter 像点票员：每看到一个词，就给它加一票。most_common(10) 会把票数最高的词排在前面。新手学文本分析时，先会做词频统计，就已经迈进了 NLP 的第一道门。

Step 6：用生成器分批吐出文本

核心源码（逐字来自文末完整源码）：

def stream_sentences(corpus: list[str], batch_size: int = 3) -> Generator[list[str], None, None]:
    """
    生成器：每次 yield 一批文本，不会一次性加载全部到内存。
    模拟处理 100 万行文本时的内存友好方案。
    """
    batch: list[str] = []
    for line in corpus:
        batch.append(line)
        if len(batch) >= batch_size:
            yield batch
            batch = []
    if batch:
        yield batch

可运行演示（补齐 Mock 数据与 print 反馈）：

from typing import Generator

def stream_sentences(corpus: list[str], batch_size: int = 3) -> Generator[list[str], None, None]:
    """
    生成器：每次 yield 一批文本，不会一次性加载全部到内存。
    模拟处理 100 万行文本时的内存友好方案。
    """
    batch: list[str] = []
    for line in corpus:
        batch.append(line)
        if len(batch) >= batch_size:
            yield batch
            batch = []
    if batch:
        yield batch

corpus = ["line-1", "line-2", "line-3", "line-4", "line-5"]
print("按 2 条一批输出：")
for batch in stream_sentences(corpus, batch_size=2):
    print(batch)

大白话解析：yield 像视频播放器的暂停键和书签：函数不会一次性把所有内容倒出来，而是攒够一批就暂停交给你。处理大文件时，这种分批输出比一次性读入内存更稳。

极客实战：完整源码与运行

现在，把上面的积木拼起来，将以下完整代码放进你的编辑器，运行它。先看整体闭环，再回头逐段改参数，你会更容易建立工程直觉。


"""
知识库文本批处理管道 —— 演示条件/循环/推导式/生成器。
用法：
    python3 text_pipeline.py
    python3 text_pipeline.py --mode filter
    python3 text_pipeline.py --mode stats
    python3 text_pipeline.py --mode stream
"""

import argparse
import re
from collections import Counter
from typing import Generator

# ── Mock 数据（零外部依赖）────────────────────────────────────
RAW_CORPUS: list[str] = [
    "  Python is a high-level programming language.  ",
    "",
    "Machine learning requires large datasets.",
    "PYTHON supports multiple programming paradigms.",
    "   ",
    "Deep learning is a subset of machine learning.",
    "Python has a simple and readable syntax.",
    "Data science uses Python extensively.",
    "Neural networks are inspired by the human brain.",
    "Python libraries like NumPy power data analysis.",
]


# ── 条件判断 ──────────────────────────────────────────────────
def classify_text(text: str) -> str:
    """用 match-case 对文本分类（Python 3.10+）。"""
    word_count = len(text.split())
    match word_count:
        case 0:
            return "empty"
        case 1 | 2:
            return "fragment"
        case n if n <= 8:
            return "sentence"
        case _:
            return "paragraph"


# ── 列表推导式 ────────────────────────────────────────────────
def clean_corpus(raw: list[str]) -> list[str]:
    """
    三步清洗：去空行 → 去首尾空白 → 统一小写。
    等价的 for 循环版本：
        result = []
        for line in raw:
            line = line.strip()
            if line:
                result.append(line.lower())
    """
    return [line.strip().lower() for line in raw if line.strip()]


# ── 字典推导式 ────────────────────────────────────────────────
def build_word_index(corpus: list[str]) -> dict[str, list[int]]:
    """构建词→行号的倒排索引。"""
    index: dict[str, list[int]] = {}
    for line_no, line in enumerate(corpus):
        words = re.findall(r"\b[a-z]+\b", line)
        for word in words:
            index.setdefault(word, []).append(line_no)
    return index


def word_frequency(corpus: list[str]) -> dict[str, int]:
    """字典推导式：统计词频。"""
    all_words = [
        word
        for line in corpus
        for word in re.findall(r"\b[a-z]+\b", line)
    ]
    return dict(Counter(all_words).most_common(10))


# ── 生成器（惰性处理，适合大文件）────────────────────────────
def stream_sentences(corpus: list[str], batch_size: int = 3) -> Generator[list[str], None, None]:
    """
    生成器：每次 yield 一批文本，不会一次性加载全部到内存。
    模拟处理 100 万行文本时的内存友好方案。
    """
    batch: list[str] = []
    for line in corpus:
        batch.append(line)
        if len(batch) >= batch_size:
            yield batch
            batch = []
    if batch:
        yield batch


# ── 输出函数 ──────────────────────────────────────────────────
def demo_filter(corpus: list[str]) -> None:
    """演示过滤与分类。"""
    print("\n  ── 文本过滤与分类 ────────────────────────")
    print(f"  原始行数: {len(RAW_CORPUS)}  清洗后: {len(corpus)}")
    print()
    for i, line in enumerate(corpus):
        category = classify_text(line)
        print(f"  [{i:>2}] {category:<10} │ {line[:50]}")


def demo_stats(corpus: list[str]) -> None:
    """演示词频统计与索引。"""
    freq = word_frequency(corpus)
    index = build_word_index(corpus)

    print("\n  ── Top 10 词频 ───────────────────────────")
    for rank, (word, count) in enumerate(freq.items(), 1):
        bar = "█" * count
        print(f"  {rank:>2}. {word:<15} {count:>3}  {bar}")

    print("\n  ── 关键词索引（python / learning）────────")
    for kw in ("python", "learning"):
        lines = index.get(kw, [])
        print(f"  '{kw}' 出现在第 {lines} 行")


def demo_stream(corpus: list[str]) -> None:
    """演示生成器批处理。"""
    print("\n  ── 生成器流式处理（batch_size=3）─────────")
    for batch_no, batch in enumerate(stream_sentences(corpus, batch_size=3), 1):
        print(f"\n  [Batch {batch_no}] {len(batch)} 条")
        for line in batch:
            print(f"    → {line[:60]}")


def main() -> None:
    parser = argparse.ArgumentParser(description="知识库文本批处理管道")
    parser.add_argument(
        "--mode",
        choices=["filter", "stats", "stream", "all"],
        default="all",
    )
    args = parser.parse_args()

    corpus = clean_corpus(RAW_CORPUS)

    if args.mode in ("filter", "all"):
        demo_filter(corpus)
    if args.mode in ("stats", "all"):
        demo_stats(corpus)
    if args.mode in ("stream", "all"):
        demo_stream(corpus)


if __name__ == "__main__":
    main()

终端预期输出（--mode stats）：

$ python3 text_pipeline.py --mode stats

  ── Top 10 词频 ───────────────────────────
   1. python          4  ████
   2. learning        3  ███
   3. machine         2  ██
   4. data            2  ██
   5. is              2  ██
   6. a               2  ██
   7. language        1  █
   8. high            1  █
   9. level           1  █
  10. programming     1  █

  ── 关键词索引（python / learning）────────
  'python' 出现在第 [0, 2, 5, 7] 行
  'learning' 出现在第 [1, 3, 6] 行

推导式速查

# 列表推导式
squares = [x**2 for x in range(10) if x % 2 == 0]

# 字典推导式
inv = {v: k for k, v in {"a": 1, "b": 2}.items()}

# 集合推导式（自动去重）
unique_lens = {len(w) for w in ["hi", "hello", "hey"]}

# 生成器表达式（惰性，节省内存）
total = sum(x**2 for x in range(1_000_000))

# 嵌套推导式（矩阵展平）
flat = [n for row in [[1,2],[3,4]] for n in row]

print("列表推导式 squares:", squares)
print("字典推导式 inv:", inv)
print("集合推导式 unique_lens:", sorted(unique_lens))
print("生成器表达式 total:", total)
print("嵌套推导式 flat:", flat)

避坑指南

坑	示例	正确做法
修改正在遍历的列表	`for x in lst: lst.remove(x)`	遍历副本 `lst[:]`
生成器只能遍历一次	第二次 `for` 什么都没有	需多次遍历时 `list(gen)`
推导式超过两层嵌套	三层嵌套推导式	改用普通循环，可读性优先
`range` 不含终止值	`range(1,5)` → 1,2,3,4	记住左闭右开

下一篇预告：有了控制流后，处理更复杂的多维数据流需要专门的容器。下一篇《4 · 任务图谱基础：列表、字典与集合》我们将系统化学习 Python 的四大内置数据结构。

NexDo Time ⚡

5 分钟极客微操：修改 stream_sentences，让它接受一个 filter_fn: Callable[[str], bool] 参数，只 yield 满足条件的文本行。例如 filter_fn=lambda s: "python" in s 只流式输出包含 “python” 的行。

Don’t wait for next time, do it in the next moment.