请明确告诉我是那句话是什么地方审核不通过!前两周看了危机公关才感觉舆论对个人,集体真的会造成挺大的伤害。在现在这个互联网四通八达的时代更是需要做到讲文明话,做文明事,当文明人。今天扒一下有据网站,看看那些是真新闻报道那些是假报道
- 话不多说直接开始,还是先分析一下网站结构,网站地址
可见数据列表页就在这里了,还是先翻个页看一下数据接口
可见数据就在html中但是这个接口看着这么奇怪了,像是json接口,但是又是get请求,数据还在html中,真奇怪。不管这了,然后看一下详情页数据是什么样子的
和列表页请求时一个样子,还是长得像ajax渲染,又像浏览器渲染的,ok那这样的就先写代码看看。 2. 还是先请求列表页,然后解析出来文章的id
headers = {
'authority': 'chinafactcheck.com',
'accept': '*/*',
'accept-language': 'zh-CN,zh;q=0.9',
'cache-control': 'no-cache',
# 'content-length': '0',
'origin': 'https://chinafactcheck.com',
'pragma': 'no-cache',
'referer': 'https://chinafactcheck.com/',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
}
params = {
'paged': '4',
}
response = requests.post('https://chinafactcheck.com/', params=params, headers=headers)
res = etree.HTML(response.text)
data_list = res.xpath("//div[@class='post-item']/div[@class='post-info-box']/div[@class='post-info']/a")
for data in data_list:
url = data.xpath("./@href")
title = data.xpath("./h2/text()")
print(title,url)
ok,没什么问题,然后请求详情页数据看一下
headers = {
'authority': 'chinafactcheck.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'zh-CN,zh;q=0.9',
'cache-control': 'no-cache',
'pragma': 'no-cache',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
}
response = requests.get(url, headers=headers)
res = etree.HTML(response.text)
conent = ''
conent_list = res.xpath("//div[@class='content-body']//text()")
for con in conent_list:
conent += con
print(conent)
input()
没什么问题