Python爬虫入门(一):使用requests和Beautifulsoup爬取论坛发帖列表

3,112 阅读1分钟

依赖安装

resuests

pip install requests

Beautifulsoup

pip install bs4

需求分析

爬取虎扑步行街主干道前50页发帖。首先,通过 requests 获取每页的返回报文 response,通过 beautifulsoup 解析报文主体 response.text

代码

import requests
from bs4 import BeautifulSoup as bs
import time

url = "https://bbs.hupu.com/bxj-postdate"
useragent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"
header = {
    'user-agent': useragent,
    # 虎扑从第11页开始就必须登录才能查看
    # 从浏览器端登录后直接复制cookie,会过时,以后运行还得重新复制一遍。
    'cookie': 'your cookie'
}

for page in range(50):
    page_url = url + '-' + str(page+1)
    print(f'------------------ 第{page+1}页内容 {page_url}-------------------')
    response = requests.get(page_url, headers=header)
    bs_info = bs(response.text, 'html.parser')
    ul = bs_info.find('ul', attrs={'class', 'for-list'})
    for li in ul.findAll('li'):
        # 标题
        title_div = li.find('div', attrs={'class', 'titlelink box'})
        a_tag = title_div.find('a', attrs={'class', 'truetit'})
        # 作者
        author_div = li.find('div', attrs={'class', 'author box'})
        author_link = author_div.find('a', attrs={'class', 'aulink'})
        # 发帖时间
        pub_date = author_div.findAll('a')[1].text
        print('https://bbs.hupu.com/'+a_tag.get('href'), a_tag.text.strip(), author_link.text, pub_date)
    time.sleep(1)

结果展示

------------------ 第1页内容 https://bbs.hupu.com/bxj-postdate-1-------------------
https://bbs.hupu.com//36407012.html 一线城市基本工资很低 呼噗呼噗我来啦 2020-07-07
https://bbs.hupu.com//36407009.html 抽象人是这样参加高考的 虎扑JR0132279583 2020-07-07
...
...
...
耗时0.19072270393371582
...
...
------------------ 第50页内容 https://bbs.hupu.com/bxj-postdate-50-------------------
https://bbs.hupu.com//36390518.html 平板看斗鱼直播很卡怎么办 听风看雨卧亭中 2020-07-06
https://bbs.hupu.com//36390517.html 男生夏天洗澡不到十分钟不是很正常吗? 胡飞飞1013 2020-07-06
...
...
...
耗时0.25310277938842773

参考链接