如何用Python读取和测试robots.txt在这个快速教程中，我们将介绍如何用Python测试、读取和提取robot

在这个快速教程中，我们将介绍如何用Python测试、读取和提取robots.txt中的信息。我们将使用两个库--urllib.request 和requests

第一步：测试 robots.txt 是否存在

首先，我们将测试robots.txt 是否存在。为此，我们将使用库requests 。我们将访问 robots.txt 页面并返回链接的状态代码：

import requests

def status_code(url):
    r = requests.get(url)
    return r.status_code

print(status_code('https://softhints.com/robots.txt'))

这意味着这个网站存在 robots.txt：https://softhints.com/。

第二步：用Python读取robots.txt

现在我们说，我们想从 robots.txt 文件中提取特定的信息--即网站地图的链接。

为了用Python读取和解析robots.txt，我们将使用。urllib.request

因此，读取和解析robots.txt文件的代码将看起来像：


robots = 'https://softhints.com/robots.txt'

sitemap_ls = []

with urlopen(robots) as stream:
    for line in urlopen(robots).read().decode("utf-8").split('\n'):
        if 'Sitemap'.lower() in line.lower():
            sitemap_url = re.findall(r' (https.*xml)', line)[0]
            sitemap_ls.append(sitemap_url)

如果代码工作正常，你将得到网站地图的链接。在保护机器人的情况下，你将得到一个错误403。

HTTPError，HTTP错误403，禁用

根据保护，你可能需要使用不同的技术来绕过它。

第三步：从任何URL中提取网站地图链接

在这最后一节中，我们将定义一个方法，它将得到一个URL，并尝试根据robots.txt 文件来提取sitemap.xml 。

该代码将重复使用上述步骤，再加上一个额外的步骤--用urlparse ，将URL转换成域名：

from urllib.request import urlopen, urlparse
import re

test_url = "https://blog.softhints.com"

def get_robots(test_url):

    domain = urlparse(test_url).netloc
    scheme = urlparse(test_url).scheme
    robots =  f'{scheme}://{domain}/robots.txt'

    sitemap_url = ''
    
    sitemap_ls = []

    with urlopen(robots) as stream:
        for line in urlopen(robots).read().decode("utf-8").split('\n'):
            if 'Sitemap'.lower() in line.lower():
                sitemap_url = re.findall(r' (https.*xml)', line)[0]
                sitemap_ls.append(sitemap_url)
    return list(set(sitemap_ls))

get_robots(test_url)

如果代码工作正常，你将得到网站的网站地图或网站地图的列表。

['https://blog.xxx.com/sitemap.xml']