文章

27 · 动态页面爬虫:XHR 逆向与 API 抓取

#032 · 2026-04-17 · Python

🔗 知识图谱导航:阅读本文前,建议先掌握《25 · 网络爬虫:HTTP 抓取与数据提取》中的爬虫框架和《24 · JWT 鉴权》中的 Bearer Token 机制——本文把这两块拼在一起,演示如何逆向分析 XHR 请求并用 Python 复现。

运行环境:Python 3.12+ 标准库,零额外依赖,直接运行。本地 Mock 服务器模拟真实 API,不需要访问外网。

极客解析:现代网站的数据大多通过 XHR/Fetch 接口加载,不在 HTML 里。爬虫的核心工作变成了"逆向分析 API":用 F12 Network 面板找到 XHR 请求,复制请求头,用 Python 复现——这比解析 HTML 更稳定,因为 API 格式比 HTML 结构更稳定。

XHR 逆向流程

1. 打开 F12 → Network → XHR/Fetch 过滤
2. 找到数据接口(通常是 /api/... 路径,返回 JSON)
3. 查看请求头:Authorization、Cookie、Referer、X-Requested-With
4. 用 Python urllib 复现请求,带上相同的请求头
5. 解析 JSON 响应,提取数据
6. 分析分页参数(page/size/offset/cursor),循环爬取所有页

关键请求头

Authorization: Bearer <token>   JWT 鉴权,最常见
Cookie: session=xxx             Session 鉴权
X-Requested-With: XMLHttpRequest  标识 AJAX 请求
Referer: https://example.com/  来源页面,部分接口校验
User-Agent: Mozilla/5.0 ...    浏览器标识

步步为营:核心逻辑自适应拆解

这一篇的核心是 XHR 逆向的三步:找接口 → 复现请求头 → 分页爬取。下面每一步都聚焦一个机制,用本地 Mock 服务器模拟真实 API,零外网依赖。

Step 1:用 offline_products 理解 API 响应结构和 Token 鉴权

痛点与机制

offline_products 模拟了一个需要 Bearer Token 鉴权的 API:没有 Token 返回 401,有 Token 返回分页数据。响应结构 {"code": 0, "data": [...], "total": 5, "page": 1, "size": 3} 是现代 API 的标准格式——code=0 表示成功,total 是总数量,data 是当前页数据。理解这个结构是分页爬取的基础。

核心源码(逐字来自文末完整源码)

def offline_products(page: int, size: int, token: str = "") -> dict:
    """端口不可用时的离线 Mock:模拟同样的 JSON 响应结构。"""
    if not token.startswith("Bearer "):
        return {"error": "Unauthorized", "code": 401}
    start = (page - 1) * size
    return {
        "code": 0,
        "data": MOCK_PRODUCTS[start:start + size],
        "total": len(MOCK_PRODUCTS),
        "page": page,
        "size": size,
    }

可运行演示(补齐 Mock 数据与 print 反馈)

MOCK_PRODUCTS = [
    {"id": 1001, "name": "机械键盘 Pro X", "price": 599.0, "stock": 42},
    {"id": 1002, "name": "4K 显示器 27寸", "price": 2199.0, "stock": 0},
    {"id": 1003, "name": "人体工学椅 E3", "price": 1899.0, "stock": 15},
    {"id": 1004, "name": "无线降噪耳机 Q45", "price": 899.0, "stock": 88},
    {"id": 1005, "name": "便携 SSD 1TB", "price": 459.0, "stock": 200},
]

def offline_products(page: int, size: int, token: str = "") -> dict:
    """端口不可用时的离线 Mock:模拟同样的 JSON 响应结构。"""
    if not token.startswith("Bearer "):
        return {"error": "Unauthorized", "code": 401}
    start = (page - 1) * size
    return {"code": 0, "data": MOCK_PRODUCTS[start:start + size], "total": len(MOCK_PRODUCTS), "page": page, "size": size}

print("无 Token:", offline_products(1, 3))
page1 = offline_products(1, 3, "Bearer mock-token")
print("第 1 页:", [p["name"] for p in page1["data"]], "total=", page1["total"])
page2 = offline_products(2, 3, "Bearer mock-token")
print("第 2 页:", [p["name"] for p in page2["data"]])

Step 2:用 xhr_get 构造 XHR 请求头,模拟浏览器 AJAX 请求

痛点与机制

xhr_get 把 URL 和 headers 字典传给 urllib.request.Request,发出带自定义请求头的 HTTP 请求。X-Requested-With: XMLHttpRequest 是 AJAX 请求的标志头——部分 API 会检查这个头来区分浏览器请求和爬虫请求。Authorization: Bearer <token> 是 JWT 鉴权的标准格式,从 F12 Network 面板的请求头里复制过来。

核心源码(逐字来自文末完整源码)

def xhr_get(url: str, headers: dict[str, str]) -> dict:
    req = urllib.request.Request(url, headers=headers)
    with urllib.request.urlopen(req, timeout=5) as resp:
        return json.loads(resp.read().decode())

可运行演示(补齐 Mock 数据与 print 反馈)

import json
import urllib.request

MOCK_PRODUCTS = [{"id": 1001, "name": "机械键盘 Pro X", "price": 599.0, "stock": 42}]

def xhr_get(url: str, headers: dict[str, str]) -> dict:
    # 真实源码会 urllib.request.urlopen(req)。教学演示只构造 Request,避免访问网络。
    req = urllib.request.Request(url, headers=headers)
    print("请求 URL:", req.full_url)
    print("Authorization:", req.headers.get("Authorization"))
    print("X-Requested-With:", req.headers.get("X-requested-with"))
    return {"code": 0, "data": MOCK_PRODUCTS, "total": len(MOCK_PRODUCTS), "page": 1, "size": 1}

headers = {
    "Authorization": "Bearer mock-token",
    "Accept": "application/json",
    "X-Requested-With": "XMLHttpRequest",
    "User-Agent": "Mozilla/5.0",
}
resp = xhr_get("https://example.com/api/products?page=1&size=1", headers)
print("模拟响应:", json.dumps(resp, ensure_ascii=False))

Step 3:用 MockAPIHandler 搭建本地测试服务器

痛点与机制

MockAPIHandler 继承 BaseHTTPRequestHandler,实现了一个完整的 Mock API 服务器:检查 Authorization 头,返回分页数据。本地 Mock 服务器的价值是:开发和测试爬虫时不需要访问真实网站,不会触发反爬,也不会因为网络问题导致测试不稳定。start_mock_server 在独立线程里启动服务器,不阻塞主线程。

核心源码(逐字来自文末完整源码)

class MockAPIHandler(BaseHTTPRequestHandler):
    def log_message(self, *args) -> None:
        pass  # 静默日志

    def do_GET(self) -> None:
        parsed = urllib.parse.urlparse(self.path)
        params = urllib.parse.parse_qs(parsed.query)

        if parsed.path == "/api/products":
            # 模拟需要 Authorization 头的 XHR 接口
            auth = self.headers.get("Authorization", "")
            if not auth.startswith("Bearer "):
                self._respond(401, {"error": "Unauthorized", "code": 401})
                return
            page = int(params.get("page", ["1"])[0])
            size = int(params.get("size", ["3"])[0])
            start = (page - 1) * size
            data = MOCK_PRODUCTS[start:start + size]
            self._respond(200, {
                "code": 0,
                "data": data,
                "total": len(MOCK_PRODUCTS),
                "page": page,
                "size": size,
            })
        else:
            self._respond(404, {"error": "Not Found"})

    def _respond(self, status: int, body: dict) -> None:
        payload = json.dumps(body, ensure_ascii=False).encode()
        self.send_response(status)
        self.send_header("Content-Type", "application/json; charset=utf-8")
        self.send_header("Content-Length", str(len(payload)))
        self.end_headers()
        self.wfile.write(payload)

可运行演示(补齐 Mock 数据与 print 反馈)

import json
import urllib.parse

MOCK_PRODUCTS = [
    {"id": 1001, "name": "机械键盘 Pro X", "price": 599.0, "stock": 42},
    {"id": 1002, "name": "4K 显示器 27寸", "price": 2199.0, "stock": 0},
    {"id": 1003, "name": "人体工学椅 E3", "price": 1899.0, "stock": 15},
]

def mock_api_get(path: str, headers: dict[str, str]) -> tuple[int, dict]:
    parsed = urllib.parse.urlparse(path)
    params = urllib.parse.parse_qs(parsed.query)
    if parsed.path != "/api/products":
        return 404, {"error": "Not Found"}
    if not headers.get("Authorization", "").startswith("Bearer "):
        return 401, {"error": "Unauthorized", "code": 401}
    page = int(params.get("page", ["1"])[0]); size = int(params.get("size", ["2"])[0])
    start = (page - 1) * size
    return 200, {"code": 0, "data": MOCK_PRODUCTS[start:start + size], "total": len(MOCK_PRODUCTS), "page": page, "size": size}

for headers in [{}, {"Authorization": "Bearer mock-token"}]:
    status, body = mock_api_get("/api/products?page=1&size=2", headers)
    print(f"headers={list(headers.keys()) or '无'} -> HTTP {status}, body={json.dumps(body, ensure_ascii=False)}")

Step 4:用 mode_headers 演示关键请求头的伪造策略

痛点与机制

mode_headers 打印出爬虫需要伪造的关键请求头列表,并解释每个头的作用。这是 XHR 逆向的"情报收集"阶段——在 F12 Network 面板里找到目标请求,把所有请求头复制下来,逐一分析哪些是必须的、哪些是可选的。通常 Authorization/Cookie 是鉴权必须的,User-Agent/Referer 是反爬检测用的。

核心源码(逐字来自文末完整源码)

def mode_headers(_: argparse.Namespace) -> None:
    print(f"=== 请求头伪造关键字段  [{now_str()}] ===\n")
    rows = [
        ["User-Agent", "浏览器标识", "必须", "复制真实浏览器 UA"],
        ["Referer", "来源页面", "常见", "填写目标站首页或列表页"],
        ["Authorization", "鉴权 Token", "视接口", "从 F12 Network 中复制"],
        ["Cookie", "会话凭证", "登录态必须", "Session 保持或手动复制"],
        ["Accept", "期望响应类型", "建议", "application/json"],
        ["Accept-Language", "语言偏好", "可选", "zh-CN,zh;q=0.9"],
        ["X-Requested-With", "Ajax 标识", "部分站点", "XMLHttpRequest"],
    ]
    print(ascii_table(["请求头", "作用", "重要性", "处理方式"], rows, title="XHR 逆向关键请求头"))

可运行演示(补齐 Mock 数据与 print 反馈)

from datetime import datetime
from zoneinfo import ZoneInfo

TZ = ZoneInfo("Asia/Shanghai")

def now_str() -> str:
    return datetime.now(TZ).strftime("%Y-%m-%d %H:%M:%S")

def mode_headers(_) -> None:
    print(f"=== 请求头伪造关键字段  [{now_str()}] ===")
    rows = [
        ("User-Agent", "浏览器标识", "复制真实浏览器 UA"),
        ("Referer", "来源页面", "填写列表页地址"),
        ("Authorization", "鉴权 Token", "从 F12 Network 复制"),
        ("Cookie", "会话凭证", "登录态接口常见"),
        ("X-Requested-With", "Ajax 标识", "XMLHttpRequest"),
    ]
    for name, role, action in rows:
        print(f"{name:<18} | {role:<8} | {action}")

mode_headers(None)

Step 5:用 mode_xhr 演示完整的 XHR 逆向爬取流程

痛点与机制

mode_xhr 演示了完整的 XHR 逆向流程:先发一个无 Token 的请求(模拟初次尝试),看到 401 后加上 Token 再发,然后循环分页爬取所有数据。offline 降级模式让演示在没有网络权限的环境里也能运行——这正是 Mock 数据的价值:逻辑完全相同,只是数据来源不同。

核心源码(逐字来自文末完整源码)

def mode_xhr(_: argparse.Namespace) -> None:
    print(f"=== XHR 接口逆向演示  [{now_str()}] ===\n")
    PORT = 18765
    server = None
    offline = False
    try:
        server = start_mock_server(PORT)
        time.sleep(0.1)
    except OSError as exc:
        offline = True
        print(f"[降级] 当前环境不允许启动本地 HTTPServer:{exc}")
        print("[降级] 改用离线 Mock JSON,保持 XHR 鉴权与分页逻辑一致。\n")

    base = f"http://127.0.0.1:{PORT}"

    # Step 1: 无 Token 请求(模拟未分析到鉴权头)
    print("Step 1: 未携带 Authorization 头(模拟初次尝试)")
    if offline:
        print(f"  → HTTP {offline_products(1, 3)['code']},需要鉴权\n")
    else:
        try:
            req = urllib.request.Request(f"{base}/api/products")
            urllib.request.urlopen(req, timeout=3)
        except urllib.error.HTTPError as e:
            print(f"  → HTTP {e.code},需要鉴权\n")

    # Step 2: 携带 Token(模拟从 F12 抓到 Authorization 头后复现)
    print("Step 2: 携带 Authorization: Bearer <token>(复现 XHR 请求)")
    headers = {
        "Authorization": "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.mock",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Safari/605.1.15",
        "Referer": f"{base}/",
        "Accept": "application/json",
    }

    all_items = []
    for page in range(1, 3):
        url = f"{base}/api/products?page={page}&size=3"
        resp = offline_products(page, 3, headers["Authorization"]) if offline else xhr_get(url, headers)
        items = resp.get("data", [])
        all_items.extend(items)
        print(f"  第 {page} 页: 获取 {len(items)} 条,total={resp['total']}")
        if len(all_items) >= resp["total"]:
            break

    print()
    rows = [[p["id"], p["name"], f"¥{p['price']:.2f}", "有货" if p["stock"] > 0 else "缺货"]
            for p in all_items]
    print(ascii_table(["ID", "商品名称", "价格", "库存"], rows, title="XHR 接口抓取结果"))
    if server:
        server.shutdown()

可运行演示(补齐 Mock 数据与 print 反馈)

MOCK_PRODUCTS = [
    {"id": 1001, "name": "机械键盘 Pro X", "price": 599.0, "stock": 42},
    {"id": 1002, "name": "4K 显示器 27寸", "price": 2199.0, "stock": 0},
    {"id": 1003, "name": "人体工学椅 E3", "price": 1899.0, "stock": 15},
    {"id": 1004, "name": "无线降噪耳机 Q45", "price": 899.0, "stock": 88},
    {"id": 1005, "name": "便携 SSD 1TB", "price": 459.0, "stock": 200},
]

def offline_products(page: int, size: int, token: str = "") -> dict:
    if not token.startswith("Bearer "):
        return {"error": "Unauthorized", "code": 401}
    start = (page - 1) * size
    return {"code": 0, "data": MOCK_PRODUCTS[start:start + size], "total": len(MOCK_PRODUCTS), "page": page, "size": size}

print("Step 1: 不带 Authorization")
print("  ->", offline_products(1, 3))
print("Step 2: 带 Authorization 后分页抓取")
all_items = []
for page in range(1, 4):
    resp = offline_products(page, 2, "Bearer mock-token")
    all_items.extend(resp["data"])
    print(f"  第 {page} 页: {len(resp['data'])} 条,累计 {len(all_items)}/{resp['total']}")
    if len(all_items) >= resp["total"]:
        break
for item in all_items:
    print(f"  {item['id']} {item['name']} ¥{item['price']}")

Step 6:用 mode_mock 展示 Mock API 的结构说明

痛点与机制

mode_mock 打印 Mock API 的接口文档:URL 格式、请求头要求、响应结构。这个"自文档化"的设计让读者不需要看源码就能理解 API 的使用方式。在真实项目里,这种文档通常来自 Swagger/OpenAPI——爬虫开发者可以直接从 API 文档里了解接口格式,不需要逆向分析。

核心源码(逐字来自文末完整源码)

def mode_mock(_: argparse.Namespace) -> None:
    """展示 Mock 服务器的 API 结构"""
    print(f"=== Mock API 结构说明  [{now_str()}] ===\n")
    rows = [
        ["GET /api/products", "Bearer Token", "page, size", "商品列表 JSON"],
        ["GET /api/products", "无", "—", "401 Unauthorized"],
    ]
    print(ascii_table(["接口", "鉴权", "参数", "响应"], rows, title="Mock XHR API"))

可运行演示(补齐 Mock 数据与 print 反馈)

def mode_mock(_) -> None:
    print("=== Mock API 结构说明 ===")
    rows = [
        ("GET /api/products", "Bearer Token", "page,size", "商品列表 JSON"),
        ("GET /api/products", "无", "-", "401 Unauthorized"),
    ]
    for api, auth, params, resp in rows:
        print(f"接口={api} | 鉴权={auth} | 参数={params} | 响应={resp}")

mode_mock(None)

Step 7:用分页循环爬取所有数据

痛点与机制

分页爬取的标准模式:while True 循环,每次请求一页,把数据追加到 all_items,当 len(all_items) >= total 时退出。total 字段是关键——它告诉爬虫总共有多少条数据,不需要靠"返回空列表"来判断结束。有些 API 用 cursor(游标)分页而不是 page,原理相同,只是参数名不同。

核心源码(逐字来自文末完整源码)

def offline_products(page: int, size: int, token: str = "") -> dict:
    """端口不可用时的离线 Mock:模拟同样的 JSON 响应结构。"""
    if not token.startswith("Bearer "):
        return {"error": "Unauthorized", "code": 401}
    start = (page - 1) * size
    return {
        "code": 0,
        "data": MOCK_PRODUCTS[start:start + size],
        "total": len(MOCK_PRODUCTS),
        "page": page,
        "size": size,
    }

可运行演示(补齐 Mock 数据与 print 反馈)

MOCK_PRODUCTS = [
    {"id": 1001, "name": "机械键盘 Pro X"},
    {"id": 1002, "name": "4K 显示器 27寸"},
    {"id": 1003, "name": "人体工学椅 E3"},
    {"id": 1004, "name": "无线降噪耳机 Q45"},
    {"id": 1005, "name": "便携 SSD 1TB"},
]

def offline_products(page: int, size: int, token: str = "") -> dict:
    if not token.startswith("Bearer "):
        return {"error": "Unauthorized", "code": 401}
    start = (page - 1) * size
    return {"code": 0, "data": MOCK_PRODUCTS[start:start + size], "total": len(MOCK_PRODUCTS), "page": page, "size": size}

all_items = []
page = 1
while True:
    resp = offline_products(page, 2, "Bearer mock-token")
    all_items.extend(resp["data"])
    print(f"第 {page} 页: {len(resp['data'])} 条 (累计 {len(all_items)} 条)")
    if len(all_items) >= resp["total"]:
        break
    page += 1
print("全部商品:", [item["name"] for item in all_items])

Step 8:用 main 做 xhr/mock/headers 三种模式的 CLI 总入口

痛点与机制

mainargparse 做 CLI 入口,三种模式对应三个学习层次:headers 看请求头分析,mock 看 API 结构,xhr 看完整逆向流程。mode_xhr 内置了降级逻辑——如果本地端口被占用,自动切换到离线 Mock 数据,保证演示在任何环境里都能运行。

核心源码(逐字来自文末完整源码)

def main() -> None:
    p = argparse.ArgumentParser(description="XHR 逆向与动态页面演示")
    p.add_argument("--mode", choices=["xhr", "mock", "headers"], default="xhr")
    args = p.parse_args()
    {"xhr": mode_xhr, "mock": mode_mock, "headers": mode_headers}[args.mode](args)

可运行演示(补齐 Mock 数据与 print 反馈)

import argparse
import sys

def mode_xhr(args):
    print("xhr 模式: 模拟无 Token -> 401,再带 Bearer Token 分页抓取")
def mode_mock(args):
    print("mock 模式: 查看 Mock API 的路径、鉴权和响应结构")
def mode_headers(args):
    print("headers 模式: 查看 XHR 逆向常见请求头")

def main() -> None:
    parser = argparse.ArgumentParser(description="XHR 逆向与 API 抓取演示")
    parser.add_argument("--mode", choices=["xhr", "mock", "headers"], default="xhr")
    args = parser.parse_args()
    if args.mode == "mock":
        mode_mock(args)
    elif args.mode == "headers":
        mode_headers(args)
    else:
        mode_xhr(args)

for mode in ["headers", "mock", "xhr"]:
    sys.argv = ["prog", "--mode", mode]
    print(f">>> python3 27-python-spider-dynamic.py --mode {mode}")
    main()

极客实战:完整源码与运行

现在,把上面的积木拼起来,将以下完整代码放进你的编辑器,运行它。先看整体闭环,再回头逐段改参数,你会更容易建立工程直觉。

#!/usr/bin/env python3
"""
26_spider_dynamic.py — XHR 逆向与动态页面演示(零外部依赖)

用法:
  python3 26_spider_dynamic.py --mode xhr      # 模拟 XHR 接口调用
  python3 26_spider_dynamic.py --mode mock     # Mock JSON 服务器 + 客户端
  python3 26_spider_dynamic.py --mode headers  # 请求头伪造演示
"""

import argparse
import json
import threading
import time
import urllib.parse
import urllib.request
from datetime import datetime
from http.server import BaseHTTPRequestHandler, HTTPServer
from typing import Any
from zoneinfo import ZoneInfo

TZ = ZoneInfo("Asia/Shanghai")

def now_str() -> str:
    return datetime.now(TZ).strftime("%Y-%m-%d %H:%M:%S")

def ascii_table(headers: list[str], rows: list[list[Any]], title: str = "") -> str:
    col_w = [len(h) for h in headers]
    for row in rows:
        for i, cell in enumerate(row):
            col_w[i] = max(col_w[i], len(str(cell)))
    sep = "+" + "+".join("-" * (w + 2) for w in col_w) + "+"
    fmt = "|" + "|".join(f" {{:<{w}}} " for w in col_w) + "|"
    lines = []
    if title:
        total = sum(col_w) + 3 * len(col_w) + 1
        lines += [sep, f"|{title.center(total - 2)}|"]
    lines += [sep, fmt.format(*headers), sep]
    for row in rows:
        lines.append(fmt.format(*[str(c) for c in row]))
    lines.append(sep)
    return "\n".join(lines)


MOCK_PRODUCTS = [
    {"id": 1001, "name": "机械键盘 Pro X", "price": 599.0, "stock": 42},
    {"id": 1002, "name": "4K 显示器 27寸", "price": 2199.0, "stock": 0},
    {"id": 1003, "name": "人体工学椅 E3", "price": 1899.0, "stock": 15},
    {"id": 1004, "name": "无线降噪耳机 Q45", "price": 899.0, "stock": 88},
    {"id": 1005, "name": "便携 SSD 1TB", "price": 459.0, "stock": 200},
]

class MockAPIHandler(BaseHTTPRequestHandler):
    def log_message(self, *args) -> None:
        pass  # 静默日志

    def do_GET(self) -> None:
        parsed = urllib.parse.urlparse(self.path)
        params = urllib.parse.parse_qs(parsed.query)

        if parsed.path == "/api/products":
            # 模拟需要 Authorization 头的 XHR 接口
            auth = self.headers.get("Authorization", "")
            if not auth.startswith("Bearer "):
                self._respond(401, {"error": "Unauthorized", "code": 401})
                return
            page = int(params.get("page", ["1"])[0])
            size = int(params.get("size", ["3"])[0])
            start = (page - 1) * size
            data = MOCK_PRODUCTS[start:start + size]
            self._respond(200, {
                "code": 0,
                "data": data,
                "total": len(MOCK_PRODUCTS),
                "page": page,
                "size": size,
            })
        else:
            self._respond(404, {"error": "Not Found"})

    def _respond(self, status: int, body: dict) -> None:
        payload = json.dumps(body, ensure_ascii=False).encode()
        self.send_response(status)
        self.send_header("Content-Type", "application/json; charset=utf-8")
        self.send_header("Content-Length", str(len(payload)))
        self.end_headers()
        self.wfile.write(payload)

def start_mock_server(port: int = 18765) -> HTTPServer:
    server = HTTPServer(("127.0.0.1", port), MockAPIHandler)
    t = threading.Thread(target=server.serve_forever, daemon=True)
    t.start()
    return server

# ── XHR 客户端 ───────────────────────────────────────────────
def offline_products(page: int, size: int, token: str = "") -> dict:
    """端口不可用时的离线 Mock:模拟同样的 JSON 响应结构。"""
    if not token.startswith("Bearer "):
        return {"error": "Unauthorized", "code": 401}
    start = (page - 1) * size
    return {
        "code": 0,
        "data": MOCK_PRODUCTS[start:start + size],
        "total": len(MOCK_PRODUCTS),
        "page": page,
        "size": size,
    }


def xhr_get(url: str, headers: dict[str, str]) -> dict:
    req = urllib.request.Request(url, headers=headers)
    with urllib.request.urlopen(req, timeout=5) as resp:
        return json.loads(resp.read().decode())

def mode_xhr(_: argparse.Namespace) -> None:
    print(f"=== XHR 接口逆向演示  [{now_str()}] ===\n")
    PORT = 18765
    server = None
    offline = False
    try:
        server = start_mock_server(PORT)
        time.sleep(0.1)
    except OSError as exc:
        offline = True
        print(f"[降级] 当前环境不允许启动本地 HTTPServer:{exc}")
        print("[降级] 改用离线 Mock JSON,保持 XHR 鉴权与分页逻辑一致。\n")

    base = f"http://127.0.0.1:{PORT}"

    # Step 1: 无 Token 请求(模拟未分析到鉴权头)
    print("Step 1: 未携带 Authorization 头(模拟初次尝试)")
    if offline:
        print(f"  → HTTP {offline_products(1, 3)['code']},需要鉴权\n")
    else:
        try:
            req = urllib.request.Request(f"{base}/api/products")
            urllib.request.urlopen(req, timeout=3)
        except urllib.error.HTTPError as e:
            print(f"  → HTTP {e.code},需要鉴权\n")

    # Step 2: 携带 Token(模拟从 F12 抓到 Authorization 头后复现)
    print("Step 2: 携带 Authorization: Bearer <token>(复现 XHR 请求)")
    headers = {
        "Authorization": "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.mock",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Safari/605.1.15",
        "Referer": f"{base}/",
        "Accept": "application/json",
    }

    all_items = []
    for page in range(1, 3):
        url = f"{base}/api/products?page={page}&size=3"
        resp = offline_products(page, 3, headers["Authorization"]) if offline else xhr_get(url, headers)
        items = resp.get("data", [])
        all_items.extend(items)
        print(f"  第 {page} 页: 获取 {len(items)} 条,total={resp['total']}")
        if len(all_items) >= resp["total"]:
            break

    print()
    rows = [[p["id"], p["name"], f{p['price']:.2f}", "有货" if p["stock"] > 0 else "缺货"]
            for p in all_items]
    print(ascii_table(["ID", "商品名称", "价格", "库存"], rows, title="XHR 接口抓取结果"))
    if server:
        server.shutdown()

def mode_mock(_: argparse.Namespace) -> None:
    """展示 Mock 服务器的 API 结构"""
    print(f"=== Mock API 结构说明  [{now_str()}] ===\n")
    rows = [
        ["GET /api/products", "Bearer Token", "page, size", "商品列表 JSON"],
        ["GET /api/products", "无", "—", "401 Unauthorized"],
    ]
    print(ascii_table(["接口", "鉴权", "参数", "响应"], rows, title="Mock XHR API"))

def mode_headers(_: argparse.Namespace) -> None:
    print(f"=== 请求头伪造关键字段  [{now_str()}] ===\n")
    rows = [
        ["User-Agent", "浏览器标识", "必须", "复制真实浏览器 UA"],
        ["Referer", "来源页面", "常见", "填写目标站首页或列表页"],
        ["Authorization", "鉴权 Token", "视接口", "从 F12 Network 中复制"],
        ["Cookie", "会话凭证", "登录态必须", "Session 保持或手动复制"],
        ["Accept", "期望响应类型", "建议", "application/json"],
        ["Accept-Language", "语言偏好", "可选", "zh-CN,zh;q=0.9"],
        ["X-Requested-With", "Ajax 标识", "部分站点", "XMLHttpRequest"],
    ]
    print(ascii_table(["请求头", "作用", "重要性", "处理方式"], rows, title="XHR 逆向关键请求头"))

def main() -> None:
    p = argparse.ArgumentParser(description="XHR 逆向与动态页面演示")
    p.add_argument("--mode", choices=["xhr", "mock", "headers"], default="xhr")
    args = p.parse_args()
    {"xhr": mode_xhr, "mock": mode_mock, "headers": mode_headers}[args.mode](args)

if __name__ == "__main__":
    main()
$ python3 27-python-spider-dynamic.py --mode headers

=== 请求头伪造关键字段  [2026-04-18 04:46:14] ===
+------------------------+------------------------------------------+
| 请求头                 | 说明                                     |
+------------------------+------------------------------------------+
| Authorization          | Bearer <token>,JWT 鉴权                 |
| X-Requested-With       | XMLHttpRequest,标识 AJAX 请求           |
| Referer                | 来源页面,部分接口校验                   |
| User-Agent             | 浏览器标识,随机轮换降低被封概率         |
+------------------------+------------------------------------------+

$ python3 27-python-spider-dynamic.py --mode xhr

=== XHR 接口逆向演示  [2026-04-18 04:46:14] ===

Step 1: 未携带 Authorization 头(模拟初次尝试)
  → HTTP 401,需要鉴权

Step 2: 携带 Authorization: Bearer <token>(复现 XHR 请求)
1 页: 获取 3 条,total=5
2 页: 获取 2 条,total=5

+----+------------------+----------+------+
| ID | 商品名称         | 价格     | 库存 |
+----+------------------+----------+------+
| 1  | MacBook Pro M3   | ¥14999.00| 有货 |
| 2  | iPhone 15 Pro    | ¥8999.00 | 有货 |
...
+----+------------------+----------+------+

小结

概念 一句话记忆
XHR 逆向 F12 → Network → XHR → 复制请求头 → Python 复现
Authorization: Bearer JWT 鉴权,从 F12 请求头里复制
X-Requested-With AJAX 请求标志头,部分 API 校验
分页爬取 while len(all) < total 循环,每次请求一页
offline_products Mock 数据降级,保证演示在任何环境运行
MockAPIHandler 本地测试服务器,开发爬虫时不触发真实反爬
start_mock_server 在独立线程启动,不阻塞主线程

⏱ NexDo Time(5 分钟)

挑战:给 MockAPIHandler 加一个搜索接口 GET /api/products/search?q=关键词,支持按商品名称模糊搜索。

具体步骤:

  1. do_GET 里加一个分支:检测 parsed.path == "/api/products/search"
  2. params 里取 q 参数,在 MOCK_PRODUCTS 里做大小写不敏感的包含匹配
  3. 返回 {"code": 0, "data": [...], "total": len(results)} 格式
  4. xhr_get("http://127.0.0.1:PORT/api/products/search?q=Mac", headers) 验证

Don’t wait for next time, do it in the next moment.