Parsing XML in Python
anomit | November 16, 2008I started off with BeautifulSoup for a certain project that needs to construct database queries out of XML. I was almost done with this component when I discovered much to my chagrin that BeautifulSoup does not respect the case of the strings in the tag name. For eg 
I quickly went through some of my starred items in Google Reader. One of the interesting finds was using lxml from the IBM developer works pages. One of the examples in that page seemed very similar to using a SAX parser. I wasn't quite interested at that moment to learn a third party library from scratch, so I decided to go for the xml.sax package already provided by default. A tutorial and reading through a couple of pages from the book 'Python and XML' later, I managed to get it right.
Basically, an SAX parser is event driven, and you can define functions that can be described like event handlers in JAVA which act upon those events. These events include encountering the opening or closing of an XML text node, an XML element node etc. The JAVA implementation of SAX defined interfaces for the handlers. SAX for one thing, doesn't have a formal specification as yet and since Python doesn't support interfaces, it included four classes in xml.sax.handler. You can read about them in the python docs. From the docs:
Handler implementations should inherit from the base classes provided in the module xml.sax.handler, so that all methods get default implementations.
The one we will be using the most is the class ContentHandler. The handler for our own XML format will inherit it. We override functions implemented in ContentHandler to customize them to our own needs. Some of the important functions are:
startElement(name,attrs): This is called by the parser when it encounters the start of an element. name holds the name of the element and attr holds the attributes.
endElement(name): Called by the parser when it encounters the end of an element.
character(content): Returns chunks of character data when found. The character data may be included in a single chunk or split into multiple chunks. It's advisable to use flags rather than relying on the former assumption.
Since a SAX parser is a stream parser and doesn't build an in-memory tree, you can't backtrack in the tree and nodes don't have the usual child-parent relationship found in DOM style parsers. Flags are used to track what element the parser is currently in and take appropriate actions. There is a caveat here. If you are working on a tree with a large depth, it can be really frustrating and painful to manage a lot of flags. The one I am working on can have a maximum depth of 5, but it would rarely exceed 4. So it wasn't a big deal for me to handle that many flags.
The basic outline of a program that makes use of SAX in Python would be like
#####################################################
from xml.sax import make_parser
class CustomHandler(ContentHandler):
def __init__(self):
#initialize flags and other data structures if needed
def startElement(self,name,attrs):
#set flags, get value of attributes etc
def endElement(self,name):
#clear flags etc
def character(self,ch):
#copy character data to any data structure if needed
ch=CustomHandler()
saxparser=make_parser()
saxparser.setContentHandler(ch)
saxparser.parse(some_file_stream)
########################################################
PS: I am still stuck with Python 2.5.2. I'm trying to keep pace by reading the changes in 2.6. I am counting on Harsh to bail me out when the need arises :p







2.6 is nothing to worry about, its the 3.0 thats
Harsh | November 19, 20082.6 is nothing to worry about, its the 3.0 thats gonna be SkyNet.
( Ain't it xml.sax.handler by the way? handlers gives an ImportError. Typo
)
Could you enable viewing of plain-text in your syntax-highlighting plugin? Removing line numbers is a pain.
Fixed the typo, removed the line numbers :) Skynet, lol!
anomit | November 19, 2008Fixed the typo, removed the line numbers
Skynet, lol!
Now I get it. How to use this, i.e. Thanks! Wouldn't
Harsh | December 27, 2008Now I get it. How to use this, i.e. Thanks!
Wouldn't have learned it if I didn't have to parse some XML file for a quick wget job.
Glad to know it helped you. :)
anomit | December 28, 2008Glad to know it helped you.