Author: | A.M. Kuchling <amk@amk.ca> |
---|---|
Version: | 1820 |
Date: | 2006-02-07 |
<document> <?xml-stylesheet type="text/css" href="basic.css"?> <!-- Generated with ElementTree --> <author name='amk' href="http://www.amk.ca" /> <p class="note">Note.</p> <p class="warning">Warning paragraph.</p> <p>Regular paragraph.</p> </document>
<?xml version="1.0"?> <document> <h1>Heading</h1> <p>Paragraph. <em>Word</em></p> </document>
Prints all paragraphs that have 'class=warning' attribute:
from elementtree import ElementTree as et tree = et.parse('ex-1.xml') for para in tree.getiterator('p'): cl = para.get('class') if cl == 'warning': print para.text
et.parse(source): returns a tree
tree = et.parse('ex-1.xml') tree = et.parse(open('ex-1.xml', 'r')) feed = urllib.urlopen( 'http://planet.python.org/rss10.xml') tree = et.parse(feed)
et.XML("<?xml ..."): returns root element
svg = et.XML("""<svg width="10px" version="1.0"> </svg>""") svg.set('height', '320px') svg.append(elem1) ...
et.XMLID(str) : (Element, dict)
| xml_doc = """<document> | <h1 id="chapter1">...</h1> | <p id="note1" class="note">...</p> | <p id="warn1" class="warning">...</p> | <p>Regular paragraph.</p> | </document>""" | root, id_d = et.XMLID(xml_doc) | print id_d | {'note1': <Element p at 3df3a0>, | 'warn1': <Element p at 3df468>, | 'chapter1': <Element h1 at 3df3f0>}
For XMLID():
- The dictionary maps element IDs to elements.
- It looks for attributes named 'id'.
- xml:id is not yet supported
et.ElementTree([root], [file]) -- Creates a new ElementTree
root = et.XML('<svg/>') new_tree = et.ElementTree(root) tree = et.ElementTree(file='ex-1.xml')
tree.write(file, encoding) -- outputs XML to file
# Encoding is US-ASCII tree.write('output.xml') f = open('output.xml', 'w') tree.write(f, 'utf-8')
file can be a filename string or a file-like object that has a write()) method.
The default encoding is us-ascii, which isn't very useful. You'll usually want to specify UTF-8.
Namespace declarations are generated on output. Prefixes aren't preserved, so instead of 'dc:creator' you'll get something like 'ns0:creator'.
tree.getroot() : returns root element of a tree.
root = tree.getroot() for c in root.getchildren(): ...
tree|elem.getiterator([tag]) -> iterator over elements
# Print all elements for elem in tree.getiterator(): ... for elem in tree.getiterator('*'): ... # Print all paragraph elements for elem in tree.getiterator('p'): ...
Traversal is pre-order:
document, block, p, p, block, p, block, p
elem.tag : the element's name
Namespaces are treated as "{namespace-uri}tag":
| <h:html xmlns:xdc="http://www.xml.com/books" | xmlns:h="http://www.w3.org/HTML/1998/html4"> | <h:body> | <xdc:bookreview> ...
h:html | {http://www.w3.org/HTML/1998/html4}html |
h:body | {http://www.w3.org/HTML/1998/html4}body |
xdc:bookreview | {http://www.xml.com/books}bookreview |
Children are accessed by slicing.
elem[n] returns the n'th element
elem[m:n] returns list of m'th through n'th children
len(elem) returns the number of children
elem.getchildren() -- returns list of children
Adding children:
elem[m:n] = [e1, e2]
elem.append(elem2) -- append as last child
elem.insert(index, elem2) -- insert at given index
Removing children:
del elem[n] -- delete n'th child
elem.remove(elem2) -- remove elem2 if it's a child
elem.makeelement(tag, attr_dict)
et.Element(tag, attr_dict, **extra)
et.SubElement(parent, tag, attr_dict, **extra)
feed = root.makeelement('feed', {'version':'0.3'}) svg = et.Element('svg', {'version':'1.0'}, width='100px', height='50px') defs = et.SubElement(svg, 'defs', {})
Atom 0.3 input looks like:
<feed version="0.3" xmlns="http://purl.org/atom/ns#"> ... <entry> ... <content type="text/html" mode="escaped"> <p><a href="http://example.org">This picture</a> ... </p> </content> </entry> </feed>
We want HTML output like this:
<div> <p><a href="http://example.org">This picture</a>... </p> <hr /> <p><a href="http://example.org/2">Photo 2</a>... </p> <hr /> </div>
ATOM_NS = 'http://purl.org/atom/ns#' tree = et.parse('atom-0.3.xml') div = et.Element('div') html = et.ElementTree(div) for entry in tree.getiterator('{%s}entry' % ATOM_NS): for content in entry.getiterator('{%s}content' % ATOM_NS): # Check for right content element here
for content in entry.getiterator('{%s}content' % ATOM_NS): typ = content.get('type') mode = content.get('mode') if typ == 'text/html' and mode == 'escaped': subtree = et.XML('<root>' + content.text.strip() + '</root>') for c in subtree.getchildren(): div.append(c) div.append(et.Element('hr')) html.write(sys.stdout)
elem.attrib : dictionary mapping names to values
elem.get(name, default=None) : get attribute value
elem.set(name, value): set a new value
elem.keys(): list of attribute names
elem.items(): list of (name, value) tuples
del elem.attrib[name]: delete an attribute
You can also access the .attrib dictionary directly.
Convert Atom 0.3 'content' elements to 1.0 form:
ATOM_CONTENT = '{%s}content' % ATOM_NS for content in tree.getiterator(ATOM_CONTENT): typ = content.get('type') mode = content.get('mode') if typ == 'text/html' and mode == 'escaped': content.set('type', 'html') del content.attrib['mode']
Elements have two attributes for text:
.text -- content between the element and its first child
.tail -- content between the element and its following sibling
<document><elem1>e1 content</elem1> Inter-element text <elem2>e2 content</elem2> </document>
Checking if an element is a comment or PI:
if elem.tag is et.Comment: ... elif elem.tag is et.ProcessingInstruction: ...
et.Comment(text) -- create a comment
et.ProcessingInstruction(target, text=None) -- create a PI
- ElementPath
- Parsing HTML
- Event-based parsing
ElementTree supports a small query language:
Simpler version of entry/content loop:
for content in tree.findall('entry/content'): ...
Methods:
findall(query): list of matching nodes
find(query): first matching element, or None
findtext(query, default=None): .text attribute of first matching element
Query = components separated by '/'
Component | Meaning |
---|---|
. | Current element node |
* | Matches any child element |
<empty string> | Match any descendant |
<name> | Matches elements with that name |
Query | Result |
---|---|
p | All p elements |
.//p | All p elements |
chapter/p | All p that are children of a chapter |
chapter/*/p | All p that are grandchildren of a chapter |
chapter//p | All p that are descendants of a chapter |
quotation/{http://purl.org/dc/elements/}creator | All dc:creator children of quotation elements |
This syntax is inspired by XPath, but it's a tiny, tiny subset of XPath. Missing features include:
- Absolute queries not allowed
- Can only select element nodes; there's no text() to select child text nodes
- No ability to select a numbered child (chapter[5] to select the fifth chapter element)
HTMLTreeBuilder, TidyHTMLTreeBuilder (requires elementtidy)
from elementtree import TidyHTMLTreeBuilder page = urllib.urlopen('http://www.python.org') tree = TidyHTMLTreeBuilder.parse(page)
et.iterparse returns a stream of events.
parser = et.iterparse('largefile.xml', ['start', 'end', 'start-ns', 'end-ns']) for event, elem in parser: if event == 'end': ... # Discard element's contents elem.clear()
On my Mac, simply parsing the 1.5Mb file into a tree took up about 7Mb. With iterparse and the clear(), the peak usage was about 2Mb. (I used the Book of Mormon because it was the largest XML document I could find.)
ElementTree: http://effbot.org/zone/element-index.htm
Slides: http://www.amk.ca/talks/2006-02-07
Questions?
Features:
- xinclude:include
- Parse types of 'text' or 'xml'
- What's not supported: xpointer, the 'encoding' or 'accept' attributes.
<root xmlns:xi="http://www.w3.org/2001/XInclude"> <xi:include href="ex-1.xml" parse="xml"/> </root>
Code:
from elementtree import ElementInclude ElementInclude.include(tree.getroot())
Result:
<root> <document> <h1>Heading</h1> ... </document> </root>
The include() function recursively scans through the entire subtree of the element. You can supply your own loader function that receives the 'href' attribute