如何从XML树中提取值(以Python为例)

26 阅读1分钟

我们有一个API查询返回了一个XML树,需要从中提取一些特定的值,例如LinkedInCount。

  1. 解决方案

    可以使用Python的XML解析库lxml来解析XML树,并提取所需的值。具体步骤如下:

    1. 安装lxml库:
    pip install lxml
    
    1. 导入lxml库和xml.etree.ElementTree库:
    import lxml.etree as ET
    from xml.etree import ElementTree
    
    1. 解析XML字符串并生成XML树:
    xml_string = """
    <aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
    <aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11">
    <aws:OperationRequest>
    <aws:RequestId>5486794a-0d03-4d47-a45b-e95764c3f0ee</aws:RequestId><
    /aws:OperationRequest>
    <aws:UrlInfoResult>
    <aws:Alexa>
    
      <aws:ContentData>
        <aws:DataUrl type="canonical">yahoo.com/</aws:DataUrl>
        <aws:Asin>B00006D2TC</aws:Asin>
        <aws:SiteData>
          <aws:Title>Yahoo!</aws:Title>
          <aws:Description>Personalized content and search options. Chatrooms, free e-mail, clubs, and pager.</aws:Description>
          <aws:OnlineSince>18-Jan-1995</aws:OnlineSince>
        </aws:SiteData>
        <aws:Speed>
          <aws:MedianLoadTime>2242</aws:MedianLoadTime>
          <aws:Percentile>51</aws:Percentile>
        </aws:Speed>
        <aws:AdultContent>no</aws:AdultContent>
        <aws:Language>
          <aws:Locale>en</aws:Locale>
        </aws:Language>
        <aws:LinksInCount>76894</aws:LinksInCount>
        <aws:OwnedDomains>
          <aws:OwnedDomain>
            <aws:Domain>yahooligans.com</aws:Domain>
            <aws:Title>yahooligans.com</aws:Title>
          </aws:OwnedDomain>
        </aws:OwnedDomains>
      </aws:ContentData>
    
      <aws:Related>
        <aws:DataUrl type="canonical">yahoo.com/</aws:DataUrl>
        <aws:Asin>B00006D2TC</aws:Asin>
        <aws:RelatedLinks>
          <aws:RelatedLink>
            <aws:DataUrl type="canonical">aol.com/</aws:DataUrl>
            <aws:NavigableUrl>http://aol.com/</aws:NavigableUrl>
            <aws:Asin>B00006ARD3</aws:Asin>
            <aws:Relevance>301</aws:Relevance>
          </aws:RelatedLink>
        </aws:RelatedLinks>
        <aws:Categories>
          <aws:CategoryData>
            <aws:Title>On the Web/Web Portals</aws:Title>
            <aws:AbsolutePath>Top/Computers/Internet/On_the_Web/Web_Portals</aws:AbsolutePath>
          </aws:CategoryData>
        </aws:Categories>
      </aws:Related>        
    
      <aws:TrafficData>
        <aws:DataUrl type="canonical">yahoo.com/</aws:DataUrl>
        <aws:Asin>B00006D2TC</aws:Asin>
        <aws:Rank>1</aws:Rank>
        <aws:UsageStatistics>
    
          <aws:UsageStatistic>
            <aws:TimeRange>
              <aws:Days>1</aws:Days>
            </aws:TimeRange>
            <aws:Rank>
              <aws:Value>1</aws:Value>
              <aws:Delta>+0</aws:Delta>
            </aws:Rank>
            <aws:Reach>
              <aws:Rank>
                <aws:Value>2</aws:Value>
                <aws:Delta>+0</aws:Delta>
              </aws:Rank>
              <aws:PerMillion>
                <aws:Value>252,500</aws:Value>
                <aws:Delta>-1%</aws:Delta>
              </aws:PerMillion>
            </aws:Reach>
            <aws:PageViews>
              <aws:PerMillion>
                <aws:Value>51,400</aws:Value>
                <aws:Delta>-1%</aws:Delta>
              </aws:PerMillion>
              <aws:Rank>
                <aws:Value>1</aws:Value>
                <aws:Delta>+0</aws:Delta>
              </aws:Rank>
              <aws:PerUser>
                <aws:Value>13.7</aws:Value>
                <aws:Delta>-1%</aws:Delta>
              </aws:PerUser>
            </aws:PageViews>
          </aws:UsageStatistic>
    
        </aws:UsageStatistics>
      </aws:TrafficData>
    
    </aws:Alexa>
    </aws:UrlInfoResult>
    <aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
    <aws:StatusCode>Success</aws:StatusCode>
    </aws:ResponseStatus>
    </aws:Response>
    </aws:UrlInfoResponse>
    """
    
    tree = ET.fromstring(xml_string)
    
    1. 使用XPath表达式查找所需的元素:
    links_in_count = tree.find("//{http://alexa.amazonaws.com/doc/2005-10-05/}LinksInCount")
    
    1. 获取元素的值:
    print(links_in_count.text)
    

    输出结果:

    76894