previous latest addition here
Note: since writing this page, the Python HOWTO has been extended to include DOM, making this page largely redundant.
this page is aimed at programmers that want to manipulate xml using python. i'm thinking of small scripts that you need during automated builds (xml config files for j2ee, perhaps) and i'm assuming that you're already a decent programmer, used to reading code and specs (and, of course, that you're already familiar with python and xml).
i'm also assuming that you want to use dom (document object model). to me this is the "obvious" way of manipulating xml - it's just an OO tree of the document, with the kind of structure you'd expect. however, it's not the only way to process xml documents:
the advantages of dom appear to be:
note that i am no expert - i only decided to use python and xml two days ago. the following is intended to help people in a similar situation get to the same level in under an hour.
this is from my own recent experience, based on the following configuration (win32 and linux):
the following are the best sources of information i've found:
the following are high-profile things that i'd suggest not wasting too much time on (i found them to be red herrings):
j2ee servers require an xml file that lists jsp pages (there's a certain format, specified by a dtd). i wanted a script that took the existing file and checked it against a directory of jsp pages, printing warnings for any files that were in the xml file but not in the directory and adding entries for any files present in the directory, but missing from the file.
since the code i write at work is the property of my employers, this isn't it. instead, i've re-written the outline, assuming a simpler xml format and only adding entries.
the xml format i'm assuming is:
<?xml...>
<DOCTYPE...>
<filelist>
<file>
<name>file_1</name>
</file>
...
<file>
<name>file_n</name>
</file>
</filelist>
and the code is:
from xml.dom.ext import Print
from xml.dom.ext.reader.Sax import FromXmlFile
import os
import sys
def addFile(doc, file, filelist):
name = doc.createElement("name")
name.appendChild(doc.createTextNode(file))
file = doc.createElement("file")
file.appendChild(name)
filelist.appendChild(file)
def contains(doc, file):
for elt in doc.getElementsByTagName("name"):
if file == elt.childNodes[0].nodeValue: return True
return False
def main():
doc = FromXmlFile(sys.argv[1])
filelist = doc.getElementsByTagName("filelist")[0]
for file in os.listdir(sys.argv[2]):
if not contains(doc, file):
addFile(doc, file, filelist)
Print(doc)
if __name__ == "__main__": main()
which works as follows:
andrew@tonto:~/src/python$ ls test newfile1 newfile2 oldfile1 andrew@tonto:~/src/python$ cat test.xml <?xml version='1.0' encoding='UTF-8'?> <!DOCTYPE filelist> <filelist> <file> <name>oldfile1</name> </file> </filelist> andrew@tonto:~/src/python$ python2.2 demo.py test.xml test <?xml version='1.0' encoding='UTF-8'?><!DOCTYPE filelist><filelist> <file> <name>oldfile1</name> </file> <file><name>newfile1</name></file><file><name>newfile2</name></file></filelist>
the example above works as expected, but the output is ugly. is there a pretty printer? one way of finding out is to grep the source:
andrew@tonto:~/src/python$ egrep -i pretty `find /usr/lib/python2.2 -name "*.py"` /usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py: def toprettyxml(self, indent="\t", newl="\n"): /usr/lib/python2.2/site-packages/_xmlplus/dom/ext/Printer.py: # PrettyPrint /usr/lib/python2.2/site-packages/_xmlplus/dom/ext/__init__.py:def PrettyPrint(root, stream=sys.stdout, encoding='UTF-8', indent=' ', [...]
and after looking in
/usr/lib/python2.2/site-packages/_xmlplus/dom/ext/__init__.py.
we finally have:
from xml.dom.ext import PrettyPrint
from xml.dom.ext.reader.Sax import FromXmlFile
import os
import sys
def addFile(doc, file, filelist):
name = doc.createElement("name")
name.appendChild(doc.createTextNode(file))
file = doc.createElement("file")
file.appendChild(name)
filelist.appendChild(file)
def contains(doc, file):
for elt in doc.getElementsByTagName("name"):
if file == elt.childNodes[0].nodeValue: return True
return False
def main():
doc = FromXmlFile(sys.argv[1])
filelist = doc.getElementsByTagName("filelist")[0]
for file in os.listdir(sys.argv[2]):
if not contains(doc, file):
addFile(doc, file, filelist)
PrettyPrint(doc)
if __name__ == "__main__": main()
which gives:
andrew@tonto:~/src/python$ python2.2 demo.py test.xml test
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE filelist>
<filelist>
<file>
<name>oldfile1</name>
</file>
<file>
<name>newfile1</name>
</file>
<file>
<name>newfile2</name>
</file>
</filelist>
one of the interesting things described in the w3c docs on dom is the tree walker class. this is an iterator that can be moved around the tree rather than the the kind of tree-walking code that applies a set of classes to each node in the tree (as in the sablecc compiler, for example) and it's very flexible (although i have always taken care not to delete the node that the walker is currently "at", or nodes above it).
there's also a Visitor, WalkerInterface and PreOrderWalker classes in
site-packages_xmlplus/dom/ext/Visitor.py that look like they
might recurse over the tree for you.
that's just from a few minutes poking around - there's a lot in there and it's worth reading through interesting-looking source files.