如何用爬虫精准采集京东评论?可以采用我们的采集方案示例

71 阅读6分钟

一:示例输出

jd1.png

二:示例结果

jd2.png

三:示例说明

你是否也曾思考过——京东上成千上万的商品,消费者到底都在评论什么?本次我们通过构建一套系统化爬虫方案,成功抓取了京东平台上975个热销商品的多维度评论数据,总计获取8546条有效评论。下面为大家揭秘我们的技术方案与实操过程。

获取jd.item_review测试地址

三、精准采集代码实现

1. 评论列表页解析

import re
import json
from bs4 import BeautifulSoup

def parse_comment_list(html):
    soup = BeautifulSoup(html, "html.parser")
    script_tag = soup.find("script", id="J-product评论-列表")
    
    if not script_tag:
        return None, 0
    
    # 提取JSON数据(京东评论数据通过JS变量存储)
    json_str = re.search(r'window.__INITIAL_STATE__=(.*?);</script>', str(script_tag)).group(1)
    data = json.loads(json_str)
    
    comments = []
    for item in data["comments"]:
        comments.append({
            "comment_id": item["id"],
            "content": item["content"],
            "score": item["score"],
            "user_name": item["userName"],
            "creation_time": item["creationTime"],
            "useful_votes": item["usefulVoteCount"],
            "reply_count": item["replyCount"],
            "images": [img["imgUrl"] for img in item.get("images", [])],
            "user_level": item["userLevelName"],
            "product_model": item.get("productColor", "") + " " + item.get("productSize", "")
        })
    
    total_comments = data["productCommentSummary"]["commentCount"]
    has_next = data["page"]["pageNo"] < data["page"]["pageTotal"]
    
    return comments, total_comments, has_next

2. 深度采集循环(含分页)

Result Object:
---------------------------------------
{
	"items": {
		"totalpage": "100",
		"total_results": 20000,
		"page_size": 10,
		"page": "1",
		"item": [
			{
				"rate_id": "21992238159",
				"rate_content": "物流和产品都不错,性价比高,赞赞赞👏👏👏质量非常好,客服态度非常非常赞,有问题及时给解决了购物体验很棒,商品物美价廉,质量优秀。物流迅速,商家服务贴心,售后无忧。高颜值,高品质,非常好,一分钱一分货,材质外观和质量一看就很上档次,非常喜欢",
				"rate_date": "2024-12-23 13:49:08",
				"pics": [
					"http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/268223/27/1816/23111/6768f9d3F79259578/1f946da747fb3842.jpg.dpg",
					"http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/267912/12/1941/21072/6768f9d3F8603a860/4778a28cfc8bb02d.jpg.dpg"
				],
				"display_user_nick": "唐***月",
				"videos": [],
				"auction_sku": null,
				"add_feedback": null
			},
			{
				"rate_id": "21940879411",
				"rate_content": "这条充电线质量非常好,线材柔软,使用寿命长。充电速度快,兼容性强,适用于多种设备。外观设计简洁大方,白色外观显得干净整洁。而且价格合理,性价比很高。使用了一段时间,没有出现任何质量问题,非常满意。推荐给需要充电线的朋友们,绝对物超所值!",
				"rate_date": "2024-12-13 23:39:55",
				"pics": [
					"http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/228791/35/34183/56412/675c553fFa70cd813/2e1022f9a25e945a.jpg.dpg",
					"http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/195256/23/50656/47240/675c5541F79f5af5e/e4d60651ed0c626f.jpg.dpg"
				],
				"display_user_nick": "j***j",
				"videos": [],
				"auction_sku": null,
				"add_feedback": null
			},
			{
				"rate_id": "21971245211",
				"rate_content": "快递很快,质量棒极了,建议购买强烈推荐!商品物超所值,质量可靠。物流快,商家服务热情,售后服务完善。物流很快,👍产品很快就收到了,比想象中还好,不错不错!希望能耐用商品质量非常好,外观设计新颖,物流速度快,商家服务态度好,性价比高。",
				"rate_date": "2024-12-20 06:09:37",
				"pics": [
					"http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/262487/32/530/222631/67649996F59b29dbb/41daef4d63774912.jpg.dpg",
					"http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/254171/36/1720/29817/6764999eF1e3da91c/49f9e4e3c0f7b489.jpg.dpg"
				],
				"display_user_nick": "j***b",
				"videos": [],
				"auction_sku": null,
				"add_feedback": null
			},
			{
				"rate_id": "22505131588",
				"rate_content": "这款充电线质量真心不错!💪 用了两年,依然如新,充电速度也很快,完全满足日常需求。非常满意的一次购物体验!👍",
				"rate_date": "2025-03-05 20:55:47",
				"pics": [
					"http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/261336/2/28436/316471/67c849d2Fd7ca649e/834ef8563c36f823.jpg.dpg"
				],
				"display_user_nick": "郑***c",
				"videos": [],
				"auction_sku": null,
				"add_feedback": null
			},
			{
				"rate_id": "21921806331",
				"rate_content": "真的超级喜欢,非常支持,质量非常好,与卖家描述的完全一致,非常满意,真的很喜欢,完全超出期望值,发货速度非常快,包装非常仔细、严实,物流公司服务态度很好,运送速度很快,很满意的一次购物",
				"rate_date": "2024-12-10 14:29:28",
				"pics": [
					"http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/151256/8/50806/62364/6757dfc7Fb0e0d3d1/cfcfcea8f15baeb6.jpg.dpg"
				],
				"display_user_nick": "j***6",
				"videos": [],
				"auction_sku": null,
				"add_feedback": null
			},
			{
				"rate_id": "22023511937",
				"rate_content": "这个商品的质量真是太好了,用起来非常顺手,效果也很满意。外观精美,不仅提升了使用体验,还为家居增添了美感。价格虽然高了一些,但相比其优良的品质和体验,绝对是物超所值。强烈推荐给追求品质生活的你!",
				"rate_date": "2024-12-29 08:22:00",
				"pics": [],
				"display_user_nick": "驰***生",
				"videos": [],
				"auction_sku": null,
				"add_feedback": null
			},
			{
				"rate_id": "23041323552",
				"rate_content": "冲电器大小适中,冲电非常的快并且不发热。非常不错!",
				"rate_date": "2025-04-21 17:06:36",
				"pics": [
					"http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/283321/39/23600/2544182/68060a9bF35b67b24/1ccea6f930ef3986.jpg.dpg",
					"http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/271623/13/24033/2449752/68060a9bFf4aa3b47/184ce7b5a88a25f4.jpg.dpg"
				],
				"display_user_nick": "j***c",
				"videos": [],
				"auction_sku": null,
				"add_feedback": null
			},
			{
				"rate_id": "22823734129",
				"rate_content": "很好的充电套装,线足够长,充电也够快,非常满意。",
				"rate_date": "2025-04-04 17:55:47",
				"pics": [
					"http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/281947/15/15093/51962/67efac76Ff463f982/a596114283fc4e3a.jpg.dpg"
				],
				"display_user_nick": "雪***泳",
				"videos": [],
				"auction_sku": null,
				"add_feedback": null
			},
			{
				"rate_id": "22036250935",
				"rate_content": "东西质量非常好,与卖家描述的完全一致,非常满意\n做工质感:好\n充电速度:好\n便携性能:好\n安全性能:好\n其他特色:好",
				"rate_date": "2024-12-31 12:06:34",
				"pics": [
					"http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/262528/8/5909/64887/67736dcaF7461323d/7565029f66830601.jpg.dpg"
				],
				"display_user_nick": "j***a",
				"videos": [],
				"auction_sku": null,
				"add_feedback": null
			},
			{
				"rate_id": "22666913189",
				"rate_content": "非常不错,质感很好,充电快",
				"rate_date": "2025-03-22 17:55:18",
				"pics": [
					"http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/276088/1/7952/71045/67de8905Fecb7c861/5bf2b9f89b0b7897.jpg.dpg"
				],
				"display_user_nick": "o***g",
				"videos": [],
				"auction_sku": null,
				"add_feedback": null
			}
		],
		"_ddf": "fb"
	},
	"secache": "5dc2b1edf5008bcf6577411b1f5fbd16",
	"secache_time": 1749537314,
	"secache_date": "2025-06-10 14:35:14",
	"translate_status": "",
	"translate_time": 0,
	"language": {
		"default_lang": "cn",
		"current_lang": "cn"
	},
	"error": "",
	"reason": "",
	"error_code": "0000",
	"cache": 0,
	"api_info": "today:71 max:10000 all[374=71+49+254];expires:2030-10-30",
	"execution_time": "4.646",
	"server_time": "Beijing/2025-06-10 14:35:14",
	"client_ip": "106.6.46.187",
	"call_args": {
		"num_iid": "10114820943599",
		"data": "1"
	},
	"api_type": "jd",
	"translate_language": "zh-CN",
	"translate_engine": "google_new",
	"server_memory": "3.33MB",
	"request_id": "gw-3.6847d21e3ae75",
	"last_id": "4513851984"
}

四、性能优化建议

  1. 分布式爬虫架构

    plaintext

    ┌───────────┐    ┌───────────┐    ┌───────────┐
    │ 调度中心  │    │ 爬虫节点  │    │ 数据仓库  │
    │ (Redis)  │←──→│ (Scrapy)│←──→│ (MongoDB)│
    └───────────┘    └───────────┘    └───────────┘
    ↑            ↑            ↑
    ├────────────┼────────────┤
    │    ┌──────┼──────┐    │
    └───→│ 代理池│←──────┘    │
         └──────┼──────┘
              ┌──┴───┐
              │ 清洗 │
              └──────┘
    
  2. 增量采集
    通过 Redis 记录最后采集时间和评论 ID,仅采集新更新的评论,减少重复请求。

五、注意事项

  1. 京东反爬升级应对

    • 定期检查页面结构变化(如评论数据存储位置从 JS 变量改为 JSON 接口)
    • 使用Selenium +undetected-chromedriver绕过最新反爬检测
  2. 代码维护成本
    爬虫代码需频繁适配京东页面更新,建议搭配Playwright等自动化工具提升健壮性。

通过以上方案,可在合规前提下实现京东评论的精准采集,确有必要时再使用爬虫,并严格控制采集规模。