Showing posts with label java. Show all posts
Showing posts with label java. Show all posts

Sunday, March 2, 2008

SAX Parser tips

Recently I got couple of interesting questions from my friends who are working on XML and using SAX parser to 'parse' the XML data - for performance and memory efficient; SAX parser can work efficiently even for 2 GB XML files!

Identifying Self ending tags:
Actually in XML both <br/> and <br/></br> are equivalent. So, using SAX parser you can't find whether it is a self ending tag or not. However there is a work around for it - using locator objects!

For <br/>, in both startElement and endElement you get the same location (getLineNumber() and getColumn number()) will be same.

For <br/></br>, they will be different – column numbers will be different (or even line number!).

But, using Locator object with SAXParser might slightly decrease the performance.
Also one more thing, all SAX may not support Locators as this is an optional feature.

More about Locators can be found at http://www.saxproject.org/apidoc/org/xml/sax/Locator.html


Handling default attributes

Problem:
Input file : <xhtml:td>VI</xhtml:td>Benzyl</xhtml:td>

Output file :
<xhtml:td rowspan="1" colspan="1">VI</xhtml:td>
<xhtml:td align="left" rowspan="1" colspan="1">Benzyl</xhtml:td>

The data has "rowspan" , “colspan” automatically included in the output. But the same is not present in the input.

The dtd declaration for the xhtml:td is as below
<!ATTLIST %td.qname;
%attrs;
abbr %Text; #IMPLIED
axis CDATA #IMPLIED
headers IDREFS #IMPLIED
scope %Scope; #IMPLIED
xhtml:rowspan %Number; "1"
xhtml:colspan %Number; "1"
%cellhalign;
%cellvalign;
>

These attributes are coming because they have a default value in DTD.

In the DTD it is mentioned that the default value of the xhtml:rowspan is 1, so unless you specify some value the rowspan will be 1.

Even if you don’t declare that attribute, SAXParser automatically get the value from the DTD (a ‘special’ feature of SAX parser called DTD defaulting).

You can only handle this in SAX2 parser (not in SAX parser version 1.x). I think most of the SAX parser available (like one comes with JDK1.5) today are SAX2.

In your startElement method, you will get an object of Attributes2 instead of Attributes; Actually Attributes2 is a subclass of Attributes.

Attributes2 interface has method isSpecified() which returns true unless the attribute value was provided by DTD defaulting.

So, keep this check in startElement method:



public void startElement (String uri, String localName,
String qName, Attributes attributes) throws SAXException
{
if (attributes instanceof Attributes2) {
Attributes2 att = (Attributes2) attributes
for (int i = 0; i < att.getLength(); i++) {
if (att.isSpecified(i)) // present in xml file
System.out.println(att.getQName(i) + "=\"" + att.getValue(i) + "\"");
else {// not present in xml file, came from DTD.
}
}
} // if not, we don't have a choice output all attributes.
}



There is another better way to check whether the SAX Parser Attributes2 or not - by checking the system property http://xml.org/sax/features/use-attributes2
More details at http://www.saxproject.org/apidoc/org/xml/sax/package-summary.html#package_description

Saturday, April 21, 2007

Benchmarks of different computer programming languages

Here is a bench mark of different programming languages,
http://shootout.alioth.debian.org/gp4/
http://shootout.alioth.debian.org/gp4/benchmark.php?test=all&lang=java

For example, a comparison between c (using gcc) and Java (Sun JDK, but just note that some time IBM JDK runs faster than Sun JDK)
http://shootout.alioth.debian.org/gp4/benchmark.php?test=all&lang=gcc&lang2=java
C application starts 91 times much faster than Java and take 22 times less memory in recursion.
Of course every one agrees that C is much faster than Java. But what is more important is the analysis of the results.
For example between C and Java, except that Java takes more memory and Start up is very slow, C is just 2 times better than Java.
Hence where memory is more and we won't start the application too often Java is better taking in to account its garbage collection and vast library. So, application servers (J2EE servers) are perfect for Java rather than Desktop application like notepad. Generally you re-start a server once in a month or so (hence start up speed of C 91 times better doesn't really matter here) and servers will have 5 GB RAM and hence whether the application takes 100 MB or 500 MB memory really is not a matter. Cool.

Now lets compare Java and Python
http://shootout.alioth.debian.org/gp4/benchmark.php?test=all&lang= java&lang2=python
Even though Python is very slow compared to Java (in some case 90 times slower in recursion), it is much better in Memory (almost in every case, it takes less memory some time 10 times) and it starts 7x better than Java. Now you got the idea, Python is better for desktop applications as it takes less memory. Desktop application memory is more important. Image Java application takes around 20 MB, then you can even run 15 Java applications on a 512 MB RAM Desktop (taking into account OS will also take some memory), but Python takes 10 times less memory so you can easily run 150 Python application simultaneously. So, we can safely conclude that Python is very hot for Desktop application (of course C is better than Python, but comparing with the huge standard library Python provides, even bigger and cleaner than Java ;- Memory usage of Python is comparable with C), it is better for Desktop applications than C or Java where memory is more important than fast (yes, imagine a Notepad written in Java than takes 20 MB of memory, 2 minutes to start and then runs as fast as C notepad. Whereas python may take around 1 or 2 MB, takes around 10 secs to load but runs slower than C Notepad. But do the user really able to differentiate a program that take 2 ms or 200 ms, here even though the other program is 100 times faster than C, it doesn't matter in this case, it matters in RDBMS or Search engines Right?).

http://shootout.alioth.debian.org/gp4/benchmark.php?test=all&lang= python &lang2=perl
Comparing Perl and Python, see that except in one case) both are equally good. So now we should consider only the Language, library, and portability advantages. For some fun we will compare these now.
Python is a very clear Object oriented language, on the other hand Perl supports both OO and Structural programming. Python has a big standard library set than Perl. But developing small proto-type application in Perl is far easier than Python (in Python you need to define a class, methods). So, perl better as a scripting language or cgi programming rather than for developing large scale applications (because for large applications Perl looks cluttered and Python look very organized).

So, which language is better? It depends on the type of application we are developing.

Regards
Suresh

Copyright (c) 2008 - Suresh