用Python和Pandas解析网站XML网站地图的解决方案需要在Python和Pandas中解析一个网站的XML网站地

需要在Python和Pandas中解析一个网站的XML网站地图？要以Pandas DataFrame的形式获得所有的URL？

如果是这样，你可以在本文中找到几个非常有用的解决方案。

方案1：用Python和Pandas解析XML网站地图

我们要读取的网站地图是这个网站的：/sitemap.xml

我们将用Python库requests ，然后用模块lxml ，解析URL和日期。

你可以看到下面的代码--首先我们**用包requests 读取网站地图。**接下来我们将内容加载到lxml ，为所有的元素创建一棵树。

最后，我们遍历网站地图中的所有元素，并将信息追加到一个dict中：

import requests
import pandas as pd
from lxml import etree

main_sitemap = 'https://blog.softhints.com/sitemap.xml'

xmlDict = []

r = requests.get(main_sitemap)
root = etree.fromstring(r.content)
print ("The number of sitemap tags are {0}".format(len(root)))
for sitemap in root:
    children = sitemap.getchildren()
    xmlDict.append({'url': children[0].text, 'date': children[1].text})

pd.DataFrame(xmlDict)

结果是一个Pandas DataFrame，它包含了网站地图的URL和日期：

Parse Website XML Sitemap with Python and Pandas

选项2：用Python解析压缩的XML网站地图

在这个选项中，我们将处理一个被压缩的网站地图。解析过程非常相似，但是包括一个额外的步骤--提取：

import requests
import gzip
from io import StringIO

r = requests.get('http://blog.softhints.com/sitemap.xml.gz')
sitemap = gzip.GzipFile(fileobj=StringIO(r.content)).read()

我们将使用StringIO from io，以读取压缩的XML网站地图的内容。

然后，我们可以用与方案1相同的方式对其进行解析。

选项3：用Python解析本地XML网站地图--无命名空间

有时你需要用Python来解析存储在你的机器上的本地站点地图。

假设你有一个本地文件，其内容类似于并命名为 -XML Sitemap.xml ：

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="//blog.softhints.com/sitemap.xsl"?><sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><sitemap><loc>https://blog.softhints.com/sitemap-pages.xml</loc><lastmod>2021-06-27T20:56:56.533Z</lastmod></sitemap><sitemap><loc>https://blog.softhints.com/sitemap-posts.xml</loc><lastmod>2021-08-16T09:50:34.531Z</lastmod></sitemap><sitemap><loc>https://blog.softhints.com/sitemap-authors.xml</loc><lastmod>2021-08-20T19:26:41.214Z</lastmod></sitemap><sitemap><loc>https://blog.softhints.com/sitemap-tags.xml</loc><lastmod>2021-08-16T09:50:34.641Z</lastmod></sitemap></sitemapindex>

要解析上述网站地图而不考虑命名空间，你可以使用下面的Python代码：

import lxml.etree


tree = lxml.etree.parse("/home/myuser/Desktop/XML Sitemap.xml")

for url in tree.xpath("//*[local-name()='loc']/text()"):
    print(url)
    
for date in tree.xpath("//*[local-name()='lastmod']/text()"):
    print(date)

这将打印出URL和日期：

https://blog.softhints.com/sitemap-pages.xml
https://blog.softhints.com/sitemap-posts.xml
https://blog.softhints.com/sitemap-authors.xml
https://blog.softhints.com/sitemap-tags.xml
2021-06-27T20:56:56.533Z
2021-08-16T09:50:34.531Z
2021-08-20T19:26:41.214Z
2021-08-16T09:50:34.641Z

注意：如果你需要找到所有的元素并打印它们的值，你可以使用方法root.iter() 。

for i in root.iter():
    print(i.text)

选项4：用Python解析本地XML网站地图 - 命名空间

如果你需要用命名空间来解析XML网站地图，你可以使用方法root.findall ，并把命名空间作为一个路径。

为了找到什么是你的命名空间，你可以通过测试根元素：

tree.getroot()

其结果是：

<Element '{http://www.sitemaps.org/schemas/sitemap/0.9}sitemapindex' at 0x7fe25fbdd720>

所以我们要使用的命名空间是{http://www.sitemaps.org/schemas/sitemap/0.9} ，我们要搜索这个元素sitemap 。

下面的代码解析了存储在本地的XML网站地图，但它也适用于请求：

import xml.etree.ElementTree as ET
tree = ET.parse("/home/myuser/Desktop/XML Sitemap.xml")
root = tree.getroot()

# In find/findall, prefix namespaced tags with the full namespace in braces
for sitemap in root.findall('{http://www.sitemaps.org/schemas/sitemap/0.9}sitemap'):
    loc = sitemap.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc').text
    lastmod = sitemap.find('{http://www.sitemaps.org/schemas/sitemap/0.9}lastmod').text
    print(loc, lastmod)

解析后的网站地图内容如下所示：

https://blog.softhints.com/sitemap-pages.xml 2021-06-27T20:56:56.533Z
https://blog.softhints.com/sitemap-posts.xml 2021-08-16T09:50:34.531Z
https://blog.softhints.com/sitemap-authors.xml 2021-08-20T19:26:41.214Z
https://blog.softhints.com/sitemap-tags.xml 2021-08-16T09:50:34.641Z