如何使用 Python 从 XML 文件中获取特定节点

49 阅读1分钟

我正在寻找一种方法,可以使用 Python DOM 内置模块从一个非常大的 XML 文档中获取特定标签。

huake_00066_.jpg 例如:

<AssetType longname="characters" shortname="chr" shortnames="chrs">
  <type>
    pub
  </type>
  <type>
    geo
  </type>
  <type>
    rig
  </type>
</AssetType>

<AssetType longname="camera" shortname="cam" shortnames="cams">
  <type>
    cam1
  </type>
  <type>
    cam2
  </type>
  <type>
    cam4
  </type>
</AssetType>

我想检索具有属性 (longname= "characters") 的 AssetType 节点的子节点的值,以获得 'pub'、'geo' 和 'rig' 的结果。需要考虑的是,我拥有超过 1000 个 < AssetType> 节点。

解决方案

有多种方法可以解决这个问题。以下是一些解决方案:

  1. 使用 lxml 库:
from lxml import etree

data = etree.parse(fname)
result = [node.text.strip() 
    for node in data.xpath("//AssetType[@longname='characters']/type")]
  1. 使用 ElementTree 库:
from xml.etree.ElementTree import ElementTree

tree = ElementTree()
root = tree.parse("assets.xml")
for assetType in root.findall("//AssetType[@longname='characters']"):
    for type in assetType.getchildren():
        print type.text
  1. 使用 pulldom 库:
from xml.dom import pulldom

def getInnerText(oNode):
    rc = ""
    nodelist = oNode.childNodes
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc = rc + node.data
        elif node.nodeType==node.ELEMENT_NODE:
            rc = rc + getInnerText(node)   # recursive !!!
        elif node.nodeType==node.CDATA_SECTION_NODE:
            rc = rc + node.data
        else:
            # node.nodeType: PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, NOTATION_NODE and so on
           pass
    return rc

stream = pulldom.parse(xml_file) 
for event, node in stream:
    if event == "START_ELEMENT" and node.nodeName == "AssetType":
        if node.getAttribute("longname") == "characters":
            stream.expandNode(node) # node now contains a mini-dom tree
            type_nodes = node.getElementsByTagName('type')
            for type_node in type_nodes:
                # type_text will have the value of what's inside the type text
                type_text = getInnerText(type_node)
  1. 使用 xml.sax 库:
from xml.sax import make_parser
from xml.sax.handler import ContentHandler

class AssetTypeHandler(ContentHandler):
    def __init__(self):
        self.in_asset_type = False
        self.asset_type_name = None
        self.types = []

    def startElement(self, name, attrs):
        if name == "AssetType":
            self.in_asset_type = True
            self.asset_type_name = attrs.get("longname")

        if self.in_asset_type and name == "type":
            self.types.append(attrs.get("value"))

    def endElement(self, name):
        if name == "AssetType":
            self.in_asset_type = False

parser = make_parser()
handler = AssetTypeHandler()
parser.setContentHandler(handler)
parser.parse("assets.xml")
  1. 使用 xpath 库:
from lxml import etree

data = etree.parse(fname)
result = data.xpath("//AssetType[@longname='characters']/type/text()")
  1. 使用性能更好的SAX解析器库:
from sax import make_parser

#这个工作方式和Xml.sax差不多