Truth, Computing and Fail

  • Home
  • About

Parsing XML in Python

anomit | November 16, 2008

I started off with BeautifulSoup for a certain project that needs to construct database queries out of XML. I was almost done with this component when I discovered much to my chagrin that BeautifulSoup does not respect the case of the strings in the tag name. For eg some text would yield the name of the tag as ‘foobar’. I had to find some way to replace the code base within a short time.
I quickly went through some of my starred items in Google Reader. One of the interesting finds was using lxml from the IBM developer works pages. One of the examples in that page seemed very similar to using a SAX parser. I wasn’t quite interested at that moment to learn a third party library from scratch, so I decided to go for the xml.sax package already provided by default. A tutorial and reading through a couple of pages from the book ‘Python and XML’ later, I managed to get it right.

Basically, an SAX parser is event driven, and you can define functions that can be described like event handlers in JAVA which act upon those events. These events include encountering the opening or closing of an XML text node, an XML element node etc. The JAVA implementation of SAX defined interfaces for the handlers. SAX for one thing, doesn’t have a formal specification as yet and since Python doesn’t support interfaces, it included four classes in xml.sax.handler. You can read about them in the python docs. From the docs:

Handler implementations should inherit from the base classes provided in the module xml.sax.handler, so that all methods get default implementations.

The one we will be using the most is the class ContentHandler. The handler for our own XML format will inherit it. We override functions implemented in ContentHandler to customize them to our own needs. Some of the important functions are:

startElement(name,attrs): This is called by the parser when it encounters the start of an element. name holds the name of the element and attr holds the attributes.

endElement(name): Called by the parser when it encounters the end of an element.

character(content): Returns chunks of character data when found. The character data may be included in a single chunk or split into multiple chunks. It’s advisable to use flags rather than relying on the former assumption.

Since a SAX parser is a stream parser and doesn’t build an in-memory tree, you can’t backtrack in the tree and nodes don’t have the usual child-parent relationship found in DOM style parsers. Flags are used to track what element the parser is currently in and take appropriate actions. There is a caveat here. If you are working on a tree with a large depth, it can be really frustrating and painful to manage a lot of flags. The one I am working on can have a maximum depth of 5, but it would rarely exceed 4. So it wasn’t a big deal for me to handle that many flags.

The basic outline of a program that makes use of SAX in Python would be like
#####################################################

from xml.sax.handler import ContentHandler
from xml.sax import make_parser

class CustomHandler(ContentHandler):
	def __init__(self):
		#initialize flags and other data structures if needed
	def startElement(self,name,attrs):
		#set flags, get value of attributes etc
	def endElement(self,name):
		#clear flags etc
	def character(self,ch):
		#copy character data to any data structure if needed
ch=CustomHandler()
saxparser=make_parser()
saxparser.setContentHandler(ch)
saxparser.parse(some_file_stream)

########################################################

PS: I am still stuck with Python 2.5.2. I’m trying to keep pace by reading the changes in 2.6. I am counting on Harsh to bail me out when the need arises :p

Categories
Coding
Tags
python, xml
Comments rss
Comments rss
Trackback
Trackback

« Wuss R Us Democracy and freedom: We don’t deserve it »

4 responses

2.6 is nothing to worry about, its the 3.0 thats

Harsh | November 19, 2008

2.6 is nothing to worry about, its the 3.0 thats gonna be SkyNet.

( Ain’t it xml.sax.handler by the way? handlers gives an ImportError. Typo :P )

Could you enable viewing of plain-text in your syntax-highlighting plugin? Removing line numbers is a pain.

Fixed the typo, removed the line numbers :) Skynet, lol!

anomit | November 19, 2008

Fixed the typo, removed the line numbers :)

Skynet, lol!

Now I get it. How to use this, i.e. Thanks! Wouldn't

Harsh | December 27, 2008

Now I get it. How to use this, i.e. Thanks!

Wouldn’t have learned it if I didn’t have to parse some XML file for a quick wget job. ;)

Glad to know it helped you. :)

anomit | December 28, 2008

Glad to know it helped you. :)

Leave a comment

You can use these tags : <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

What’s in

  • Apologies
  • Examining the Linux VDSO
  • Symlinks in a libfs virtual file system: The Pains
  • Small rant on the FUSE API reference
  • Kernel module debugging: a simple technique

Blogroll

  • Akshay Kothari
  • Ankur Shrivastav (OS)
  • Ankur Sinha
  • Harsh J
  • Hullap
  • LUG manipal
  • Swap

Tags

aircrack airfail airtel assembly blues build c Coding college country cryptography dean faculty file systems fuckery gnuplot hacking India kernel linux mangalore manipal mpd music NASM plugin politicians pub culture python rant rock sam scheduler simulation SSFNet stupidity supernatural suppression syscall syscalls system calls unix vim xchat xml

Archives

  • December 2010
  • April 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • January 2009
  • November 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • October 2007
  • September 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007

License

Creative Commons License
This work by Anomit Ghosh is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 India License.
rss Comments rss valid xhtml 1.1 design by jide powered by Wordpress get firefox