Processing XML with ElementTree

Author: A.M. Kuchling <amk@amk.ca>
Version: 1820
Date: 2006-02-07

Anatomy of an XML document

<document>
  <?xml-stylesheet type="text/css"
        href="basic.css"?>
  <!-- Generated with ElementTree -->

  <author name='amk' href="http://www.amk.ca" />

  <p class="note">Note.</p>
  <p class="warning">Warning paragraph.</p>
  <p>Regular paragraph.</p>
</document>

Example document tree

<?xml version="1.0"?>
<document>
  <h1>Heading</h1>
  <p>Paragraph.  <em>Word</em></p>
</document>

An example tree produced by ElementTree.

Example: Print warning paragraphs

Prints all paragraphs that have 'class=warning' attribute:

from elementtree import ElementTree as et

tree = et.parse('ex-1.xml')

for para in tree.getiterator('p'):
    cl = para.get('class')
    if cl == 'warning':
        print para.text

Document input: parse()

et.parse(source): returns a tree

tree = et.parse('ex-1.xml')

tree = et.parse(open('ex-1.xml', 'r'))

feed = urllib.urlopen(
          'http://planet.python.org/rss10.xml')
tree = et.parse(feed)

Document input: XML()

et.XML("<?xml ..."): returns root element

svg = et.XML("""<svg width="10px" version="1.0">
             </svg>""")
svg.set('height', '320px')
svg.append(elem1)
...

Document input: XMLID

et.XMLID(str) : (Element, dict)

|    xml_doc = """<document>
|      <h1 id="chapter1">...</h1>
|      <p id="note1" class="note">...</p>
|      <p id="warn1" class="warning">...</p>
|      <p>Regular paragraph.</p>
|    </document>"""
|    root, id_d = et.XMLID(xml_doc)

|    print id_d
|    {'note1': <Element p at 3df3a0>,
|     'warn1': <Element p at 3df468>,
|     'chapter1': <Element h1 at 3df3f0>}

For XMLID():

  • The dictionary maps element IDs to elements.
  • It looks for attributes named 'id'.
  • xml:id is not yet supported

Creating a new tree

et.ElementTree([root], [file]) -- Creates a new ElementTree

root = et.XML('<svg/>')
new_tree = et.ElementTree(root)

tree = et.ElementTree(file='ex-1.xml')

Document Output

tree.write(file, encoding) -- outputs XML to file

# Encoding is US-ASCII
tree.write('output.xml')

f = open('output.xml', 'w')
tree.write(f, 'utf-8')

file can be a filename string or a file-like object that has a write()) method.

The default encoding is us-ascii, which isn't very useful. You'll usually want to specify UTF-8.

Namespace declarations are generated on output. Prefixes aren't preserved, so instead of 'dc:creator' you'll get something like 'ns0:creator'.

Traversing a tree: getroot()

tree.getroot() : returns root element of a tree.

root = tree.getroot()
for c in root.getchildren(): ...

Traversing a tree: getiterator()

tree|elem.getiterator([tag]) -> iterator over elements

# Print all elements
for elem in tree.getiterator():
    ...
for elem in tree.getiterator('*'):
    ...

# Print all paragraph elements
for elem in tree.getiterator('p'):
    ...

Traversing a tree: getiterator()

Traversal is pre-order:

document, block, p, p, block, p, block, p
Example tree with nested elements.

Elements: The element name

elem.tag : the element's name

Namespaces are treated as "{namespace-uri}tag":

| <h:html xmlns:xdc="http://www.xml.com/books"
|    xmlns:h="http://www.w3.org/HTML/1998/html4">
|   <h:body>
|     <xdc:bookreview> ...
h:html {http://www.w3.org/HTML/1998/html4}html
h:body {http://www.w3.org/HTML/1998/html4}body
xdc:bookreview {http://www.xml.com/books}bookreview

Elements: Children

Children are accessed by slicing.

elem[n] returns the n'th element

elem[m:n] returns list of m'th through n'th children

len(elem) returns the number of children

elem.getchildren() -- returns list of children

Elements: Modifying children

Adding children:

elem[m:n] = [e1, e2]

elem.append(elem2) -- append as last child

elem.insert(index, elem2) -- insert at given index

Removing children:

del elem[n] -- delete n'th child

elem.remove(elem2) -- remove elem2 if it's a child

Creating elements

elem.makeelement(tag, attr_dict)

et.Element(tag, attr_dict, **extra)

et.SubElement(parent, tag, attr_dict, **extra)

feed = root.makeelement('feed',
                        {'version':'0.3'})

svg = et.Element('svg', {'version':'1.0'},
                 width='100px', height='50px')

defs = et.SubElement(svg, 'defs', {})

Example: Generating HTML from Atom

Atom 0.3 input looks like:

<feed version="0.3" xmlns="http://purl.org/atom/ns#"> ...
    <entry> ...
            <content type="text/html" mode="escaped">
&lt;p&gt;&lt;a href="http://example.org"&gt;This
picture&lt;/a&gt; ... &lt;/p&gt; </content>
    </entry>
</feed>

We want HTML output like this:

<div>
   <p><a href="http://example.org">This picture</a>...
   </p>
   <hr />
   <p><a href="http://example.org/2">Photo 2</a>...
   </p>
   <hr />
</div>

Example: Rearranging a tree (1)

ATOM_NS = 'http://purl.org/atom/ns#'

tree = et.parse('atom-0.3.xml')

div = et.Element('div')
html = et.ElementTree(div)

for entry in tree.getiterator('{%s}entry'
                              % ATOM_NS):
  for content in entry.getiterator('{%s}content'
                                   % ATOM_NS):
      # Check for right content element here

Example: Rearranging a tree (2)

for content in entry.getiterator('{%s}content'
                                 % ATOM_NS):
    typ = content.get('type')
    mode = content.get('mode')

    if typ == 'text/html' and mode == 'escaped':
        subtree = et.XML('<root>' +
                         content.text.strip()
                         + '</root>')
        for c in subtree.getchildren():
            div.append(c)
        div.append(et.Element('hr'))

html.write(sys.stdout)

Elements: Attribute handling

elem.attrib : dictionary mapping names to values

elem.get(name, default=None) : get attribute value

elem.set(name, value): set a new value

elem.keys(): list of attribute names

elem.items(): list of (name, value) tuples

del elem.attrib[name]: delete an attribute

You can also access the .attrib dictionary directly.

Example: Attributes

Convert Atom 0.3 'content' elements to 1.0 form:

ATOM_CONTENT = '{%s}content' % ATOM_NS
for content in tree.getiterator(ATOM_CONTENT):
    typ = content.get('type')
    mode = content.get('mode')

    if typ == 'text/html' and mode == 'escaped':
        content.set('type', 'html')
        del content.attrib['mode']

Elements: Accessing text

Elements have two attributes for text:

.text -- content between the element and its first child

.tail -- content between the element and its following sibling

<document><elem1>e1 content</elem1>  
Inter-element text
<elem2>e2 content</elem2>  
</document>

Tree diagram showing text

Comments and PIs

Checking if an element is a comment or PI:

if elem.tag is et.Comment:
    ...
elif elem.tag is et.ProcessingInstruction:
    ...

et.Comment(text) -- create a comment

et.ProcessingInstruction(target, text=None) -- create a PI

Advanced topics

  • ElementPath
  • Parsing HTML
  • Event-based parsing

ElementPath

ElementTree supports a small query language:

Simpler version of entry/content loop:

for content in tree.findall('entry/content'):
    ...

Methods:

findall(query): list of matching nodes

find(query): first matching element, or None

findtext(query, default=None): .text attribute of first matching element

ElementPath syntax

Query = components separated by '/'

Component Meaning
. Current element node
* Matches any child element
<empty string> Match any descendant
<name> Matches elements with that name
Query Result
p All p elements
.//p All p elements
chapter/p All p that are children of a chapter
chapter/*/p All p that are grandchildren of a chapter
chapter//p All p that are descendants of a chapter
quotation/{http://purl.org/dc/elements/}creator All dc:creator children of quotation elements

This syntax is inspired by XPath, but it's a tiny, tiny subset of XPath. Missing features include:

  • Absolute queries not allowed
  • Can only select element nodes; there's no text() to select child text nodes
  • No ability to select a numbered child (chapter[5] to select the fifth chapter element)

Parsing HTML

HTMLTreeBuilder, TidyHTMLTreeBuilder (requires elementtidy)

from elementtree import TidyHTMLTreeBuilder

page = urllib.urlopen('http://www.python.org')
tree = TidyHTMLTreeBuilder.parse(page)

Event-based processing

et.iterparse returns a stream of events.

parser = et.iterparse('largefile.xml',
           ['start', 'end', 'start-ns', 'end-ns'])
for event, elem in parser:
    if event == 'end':
        ...
        # Discard element's contents
        elem.clear()

On my Mac, simply parsing the 1.5Mb file into a tree took up about 7Mb. With iterparse and the clear(), the peak usage was about 2Mb. (I used the Book of Mormon because it was the largest XML document I could find.)

Conclusion

ElementTree: http://effbot.org/zone/element-index.htm

Slides: http://www.amk.ca/talks/2006-02-07

Questions?

ElementInclude

Features:

  • xinclude:include
  • Parse types of 'text' or 'xml'
  • What's not supported: xpointer, the 'encoding' or 'accept' attributes.
<root xmlns:xi="http://www.w3.org/2001/XInclude">
  <xi:include href="ex-1.xml" parse="xml"/>
</root>

ElementInclude Example

Code:

from elementtree import ElementInclude
ElementInclude.include(tree.getroot())

Result:

<root>
  <document>
    <h1>Heading</h1> ...
  </document>
</root>

The include() function recursively scans through the entire subtree of the element. You can supply your own loader function that receives the 'href' attribute