Python爬虫——主题爬取搜狐新闻(步骤及代码实现)

1,916 阅读4分钟

一 、实现思路

本次爬取搜狐新闻时政类

![](https://p1-tt-ipv6.byteimg.com/origin/pgc-image/7df50830cad94be5aee009e5c547974a)

获取url——爬取新闻名称及其超链接——判断与主题契合度——得到最终结果

二、获取url变化规律

观察发现,搜狐新闻页面属于动态页面
但是F12——network——XHR下并没有文件所以不能从这里找
从ALL中发现该文件中有想要找的内容

![](https://p1-tt-ipv6.byteimg.com/origin/pgc-image/8af86ec85e1f4f75b2034e062846995e)

发现该文件属于js文件

![](https://p6-tt-ipv6.byteimg.com/origin/pgc-image/2cf9f83075834cbaab2b1027ab0a8782)

观察四个feed开头的文件的url规律

![](https://p1-tt-ipv6.byteimg.com/origin/pgc-image/974da01dca4a49dcb5e873da0efcd4fd)

page变化 callback变化无规律 最后的数字每页+8 将callback去掉发现对网页内容无影响
所以最终的page获取代码 采用字符串拼接的形式

for p in range(1,10):
        p2=1603263206992+p*8 
        url='https://v2.sohu.com/public-api/feed?scene=CATEGORY&sceneId=1460&page='+str(p)+'&size=20&_='+str(p2)

三、爬取新闻名称及其超链接

本次用正则表达式获取

实现代码:

headers={
               'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36',
               'cookie':'itssohu=true; BAIDU_SSP_lcr=https://news.hao123.com/wangzhi; IPLOC=CN3300; SUV=201021142102FD7T; reqtype=pc; gidinf=x099980109ee124d51195e802000a3aab2e8ca7bf7da; t=1603261548713; jv=78160d8250d5ed3e3248758eeacbc62e-kuzhE2gk1603261903982; ppinf=2|1603261904|1604471504|bG9naW5pZDowOnx1c2VyaWQ6Mjg6MTMxODgwMjEyODc2ODQzODI3MkBzb2h1LmNvbXxzZXJ2aWNldXNlOjMwOjAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMHxjcnQ6MTA6MjAyMC0xMC0yMXxlbXQ6MTowfGFwcGlkOjY6MTE2MDA1fHRydXN0OjE6MXxwYXJ0bmVyaWQ6MTowfHJlbGF0aW9uOjA6fHV1aWQ6MTY6czExZjVhZTI2NTJiNmM3Nnx1aWQ6MTY6czExZjVhZTI2NTJiNmM3Nnx1bmlxbmFtZTowOnw; pprdig=L2Psu-NwDR2a1BZITLwhlxdvI2OrHzl6jqQlF3zP4z70gqsyYxXmf5dCZGuhPFZ-XWWE5mflwnCHURGUQaB5cxxf8HKpzVIbqTJJ3_TNhPgpDMMQdFo64Cqoay43UxanOZJc4-9dcAE6GU3PIufRjmHw_LApBXLN7sOMUodmfYE; ppmdig=1603261913000000cfdc2813caf37424544d67b1ffee4770'
                }
        res=requests.get(url,headers=headers)
        soup=BeautifulSoup(res.text,'lxml')
        news=re.findall('"mobileTitle":"(.*?)",',str(soup))
        herf=re.findall('"originalSource":"(.*?)"',str(soup))
        #news=soup.find_all("div",attrs={'class':'news-wrapper'})
        #html=etree.HTML(res.text)
        #news=html.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/div/div[3]/div[3]/h4/a/text()')
        news_dic=dict(zip(news,herf))#把标题和链接储存到字典
        for k,v in news_dic.items():
            news_dictall[k]=v #每一页的字典合并

四、判断与主题的契合度

def ifsim(topicwords):
    news_dicfin={}
    news_dic=getdata()
    ana.set_stop_words('D:\作业\python\文本挖掘\数据集\新闻数据集\data\stopwords.txt') # 输入停用词

    for k,v in news_dic.items():
        word_list=ana.extract_tags(k,topK=50,withWeight=False) #去除停用词+词频分析
        #word_lil.append(word_list)
        word_lil=[]
        for i in word_list:
            word_lil.append([i])#将分词转化为list in list 形式以便传入dictionary
        word_dic=Dictionary(word_lil)#转化为dictionary词典形式 以便分析
        d=dict(word_dic.items())
        docwords=set(d.values())
        #相关度计算
        commwords=topicwords.intersection(docwords)#取交集
        if len(commwords)>0:#交集>0符合条件的存入最终的字典

            news_dicfin[k]=v
    print(news_dicfin)

若直接输出word_dic结果为:

![](https://p1-tt-ipv6.byteimg.com/origin/pgc-image/14dbb5fbb39f4cc8a3c3109922cc494d)

docwords输出结果为:

![](https://p26-tt.byteimg.com/origin/pgc-image/8ce9e59f8e3143db9281bc1a2b632a07)

word_list输出结果:

![](https://p1.pstatp.com/origin/pgc-image/a16155315e0e456ca747bede710d11b2)

word_lil输出结果为:

![](https://p1.pstatp.com/origin/pgc-image/165d438aa3244e9c9e2773fe6b808a6a)

d的输出结果为:

![](https://p1.pstatp.com/origin/pgc-image/009e2464b9da46f6bfc451dce40f19c5)

四、输出结果

本次通过判断标题与我给定主题词的相同的个数即交集>0即判定该词属于主题模型
并将其存入最终字典
news_sicfin的输出结果为:

![](https://p9-tt-ipv6.byteimg.com/origin/pgc-image/84ccd599c33d4497aeb83fbf59520aae)

五、总代码

import requests
from bs4 import BeautifulSoup
import jieba
from gensim.corpora.dictionary import Dictionary
import re
import jieba.analyse as ana

def getdata():
    #news_all=[]
    news_dictall={}
    for p in range(1,10):
        p2=1603263206992+p*8 
        url='https://v2.sohu.com/public-api/feed?scene=CATEGORY&sceneId=1460&page='+str(p)+'&size=20&_='+str(p2)
        headers={
               'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36',
               'cookie':'itssohu=true; BAIDU_SSP_lcr=https://news.hao123.com/wangzhi; IPLOC=CN3300; SUV=201021142102FD7T; reqtype=pc; gidinf=x099980109ee124d51195e802000a3aab2e8ca7bf7da; t=1603261548713; jv=78160d8250d5ed3e3248758eeacbc62e-kuzhE2gk1603261903982; ppinf=2|1603261904|1604471504|bG9naW5pZDowOnx1c2VyaWQ6Mjg6MTMxODgwMjEyODc2ODQzODI3MkBzb2h1LmNvbXxzZXJ2aWNldXNlOjMwOjAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMHxjcnQ6MTA6MjAyMC0xMC0yMXxlbXQ6MTowfGFwcGlkOjY6MTE2MDA1fHRydXN0OjE6MXxwYXJ0bmVyaWQ6MTowfHJlbGF0aW9uOjA6fHV1aWQ6MTY6czExZjVhZTI2NTJiNmM3Nnx1aWQ6MTY6czExZjVhZTI2NTJiNmM3Nnx1bmlxbmFtZTowOnw; pprdig=L2Psu-NwDR2a1BZITLwhlxdvI2OrHzl6jqQlF3zP4z70gqsyYxXmf5dCZGuhPFZ-XWWE5mflwnCHURGUQaB5cxxf8HKpzVIbqTJJ3_TNhPgpDMMQdFo64Cqoay43UxanOZJc4-9dcAE6GU3PIufRjmHw_LApBXLN7sOMUodmfYE; ppmdig=1603261913000000cfdc2813caf37424544d67b1ffee4770'
                }
        res=requests.get(url,headers=headers)
        soup=BeautifulSoup(res.text,'lxml')
        news=re.findall('"mobileTitle":"(.*?)",',str(soup))
        herf=re.findall('"originalSource":"(.*?)"',str(soup))
        #news=soup.find_all("div",attrs={'class':'news-wrapper'})
        #html=etree.HTML(res.text)
        #news=html.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/div/div[3]/div[3]/h4/a/text()')
        news_dic=dict(zip(news,herf))#把标题和链接储存到字典
        for k,v in news_dic.items():
            news_dictall[k]=v #每一页的字典合并
    return(news_dictall)#返回总字典
def ifsim(topicwords):
    news_dicfin={}
    news_dic=getdata()
    ana.set_stop_words('D:\作业\python\文本挖掘\数据集\新闻数据集\data\stopwords.txt') # 输入停用词

    for k,v in news_dic.items():
        word_list=ana.extract_tags(k,topK=50,withWeight=False) #去除停用词+词频分析
        #word_lil.append(word_list)
        word_lil=[] 
        for i in word_list:
            word_lil.append([i])#将分词转化为list in list 形式以便传入dictionary
        word_dic=Dictionary(word_lil)#转化为dictionary词典形式 以便分析
        d=dict(word_dic.items())
        docwords=set(d.values())
        #相关度计算
        commwords=topicwords.intersection(docwords)#取交集
        if len(commwords)>0:#交集>0符合条件的存入最终的字典

            news_dicfin[k]=v
    print(news_dicfin)
if __name__=='__main__':
    topicwords={"疫情","新冠","肺炎","确诊","病例"}
    ifsim(topicwords)

完整代码以分享!如有疑问需交流的点击下方!

PS:如有需要Python学习资料的小伙伴可以加点击下方链接自行获取

python免费学习资料以及群交流解答点击即可加入