Sunday, March 2, 2008

SAX Parser tips

Recently I got couple of interesting questions from my friends who are working on XML and using SAX parser to 'parse' the XML data - for performance and memory efficient; SAX parser can work efficiently even for 2 GB XML files!

Identifying Self ending tags:
Actually in XML both <br/> and <br/></br> are equivalent. So, using SAX parser you can't find whether it is a self ending tag or not. However there is a work around for it - using locator objects!

For <br/>, in both startElement and endElement you get the same location (getLineNumber() and getColumn number()) will be same.

For <br/></br>, they will be different – column numbers will be different (or even line number!).

But, using Locator object with SAXParser might slightly decrease the performance.
Also one more thing, all SAX may not support Locators as this is an optional feature.

More about Locators can be found at http://www.saxproject.org/apidoc/org/xml/sax/Locator.html


Handling default attributes

Problem:
Input file : <xhtml:td>VI</xhtml:td>Benzyl</xhtml:td>

Output file :
<xhtml:td rowspan="1" colspan="1">VI</xhtml:td>
<xhtml:td align="left" rowspan="1" colspan="1">Benzyl</xhtml:td>

The data has "rowspan" , “colspan” automatically included in the output. But the same is not present in the input.

The dtd declaration for the xhtml:td is as below
<!ATTLIST %td.qname;
%attrs;
abbr %Text; #IMPLIED
axis CDATA #IMPLIED
headers IDREFS #IMPLIED
scope %Scope; #IMPLIED
xhtml:rowspan %Number; "1"
xhtml:colspan %Number; "1"
%cellhalign;
%cellvalign;
>

These attributes are coming because they have a default value in DTD.

In the DTD it is mentioned that the default value of the xhtml:rowspan is 1, so unless you specify some value the rowspan will be 1.

Even if you don’t declare that attribute, SAXParser automatically get the value from the DTD (a ‘special’ feature of SAX parser called DTD defaulting).

You can only handle this in SAX2 parser (not in SAX parser version 1.x). I think most of the SAX parser available (like one comes with JDK1.5) today are SAX2.

In your startElement method, you will get an object of Attributes2 instead of Attributes; Actually Attributes2 is a subclass of Attributes.

Attributes2 interface has method isSpecified() which returns true unless the attribute value was provided by DTD defaulting.

So, keep this check in startElement method:



public void startElement (String uri, String localName,
String qName, Attributes attributes) throws SAXException
{
if (attributes instanceof Attributes2) {
Attributes2 att = (Attributes2) attributes
for (int i = 0; i < att.getLength(); i++) {
if (att.isSpecified(i)) // present in xml file
System.out.println(att.getQName(i) + "=\"" + att.getValue(i) + "\"");
else {// not present in xml file, came from DTD.
}
}
} // if not, we don't have a choice output all attributes.
}



There is another better way to check whether the SAX Parser Attributes2 or not - by checking the system property http://xml.org/sax/features/use-attributes2
More details at http://www.saxproject.org/apidoc/org/xml/sax/package-summary.html#package_description

No comments:

Copyright (c) 2008 - Suresh