动态与静态结合：抓取移动端新闻数据的探索实践-CSDN博客

本文链接：https://blog.csdn.net/ip16yun/article/details/148711385

在移动设备几乎成为人们主要阅读渠道的今天，各类新闻App不断迭代，从界面优化到推荐算法，背后数据结构也变得越来越复杂。以今日头条为例，它不仅提供资讯流，还根据用户行为推送内容，这些数据大部分来自动态接口，而非传统网页静态HTML。

这就给信息采集带来了新挑战。如何既能快速拿到新闻列表，又能进一步抓取评论这种更深层的数据？依靠单一手段已经远远不够。

多线并进：应对APP结构差异的策略

要应对这些结构复杂、接口多样的App数据，通常会从几个方向入手：

界面分析：移动端和Web端所展示的数据并不完全一致，建议对比分析今日头条的H5版新闻页面与APP内部结构，找出哪个数据层更容易接入。
请求模拟：像新闻标题、摘要这种内容，在H5端可直接用静态方式请求获取；而评论区往往需要模拟客户端发起动态API请求才能拿到。
身份伪装：请求过程中需要带上真实的用户信息，避免被判为机器人请求。
IP切换：频繁请求同一个App接口容易触发频控，因此引入代理服务是常规操作，这里用的是爬虫代理通道。

实战：以今日头条「今日要闻」为例

以H5接口为突破口，先抓取当天热门要闻。这个接口返回的字段相对整洁，不需登录即可访问。评论部分则需要构造移动端接口路径，虽然参数较多，但经过观察大致可以还原出格式。

以下是核心代码：

import requests
import json
from fake_useragent import UserAgent

# 使用亿牛云爬虫代理服务 www.16yun.cn
proxies = {
    "http": "http://16YUN:16IP@proxy.16yun.cn:31000",
    "https": "http://16YUN:16IP@proxy.16yun.com:31000"
}

# 伪造请求头，模拟浏览器或移动端访问
headers = {
    "User-Agent": UserAgent().random,
    "Cookie": "tt_webid=1234567890abcdef;",
    "Referer": "https://www.toutiao.com/"
}

# 拉取新闻主列表（H5版本接口）
def get_news_brief():
    url = "https://www.toutiao.com/hot-event/hot-board/"
    try:
        res = requests.get(url, headers=headers, proxies=proxies, timeout=10)
        items = res.json().get("data", [])
        news = []
        for entry in items:
            news.append({
                "title": entry.get("Title"),
                "abstract": entry.get("Desc"),
                "url": entry.get("Url")
            })
        return news
    except Exception as e:
        print("列表获取失败:", e)
        return []

# 构造动态评论接口（需抓包分析得到group_id等）
def get_comments_simulated(news_url):
    fake_group_id = "7130455657910928926"  # 示例参数
    comment_api = f"https://www.toutiao.com/article/v2/tab_comments/?group_id={fake_group_id}&count=10"
    try:
        res = requests.get(comment_api, headers=headers, proxies=proxies, timeout=10)
        comment_data = res.json()
        return [c["text"] for c in comment_data.get("data", {}).get("comments", [])]
    except Exception as e:
        print("评论抓取失败:", e)
        return []

# 整合流程，采集并保存数据
def run_scraper():
    result = []
    newslist = get_news_brief()
    print(f"共获取 {len(newslist)} 条资讯")

    for news in newslist:
        comments = get_comments_simulated(news["url"])
        news["comments"] = comments
        result.append(news)

    with open("toutiao_data.json", "w", encoding="utf-8") as f:
        json.dump(result, f, ensure_ascii=False, indent=2)
    print("数据已写入文件：toutiao_data.json")

if __name__ == "__main__":
    run_scraper()