Comparing SSAX and Expat XML parsers in features and performance

  1. Introduction to the benchmark
  2. Expat and SSAX
  3. Summary of the performance benchmark
  4. Why SSAX?
    1. SSAX has a better API
  5. Answers to comments
    1. Choice of XML trees as input files for the benchmark
    2. Is the benchmark fair to Expat?
    3. Is SSAX really referentially transparent?
    4. What about performance of Expat if it were to wrapped in Scheme?
  6. References

We present the results of a benchmark that compares the performance of C and Scheme XML parsers. The benchmark gives an insight in the performance of Scheme, compared to the best C code. We also discuss differences between SSAX and Expat in terms of features and the user interface.

This page is a compilation of two articles posted on a newsgroup comp.lang.scheme On Nov 1 and 5, 2001, extended with answers to common comments.


  

Introduction to the benchmark

The benchmark is "untagging" of an XML document realizing a full binary tree, such as

     <node><node><node><node><leaf>0</leaf><leaf>1</leaf></node>
                       <node><leaf>2</leaf><leaf>3</leaf></node></node>
                 <node><node><leaf>4</leaf><leaf>5</leaf></node>
                       <node><leaf>6</leaf><leaf>7</leaf></node></node></node>
           <node><node><node><leaf>8</leaf><leaf>9</leaf></node>
                       <node><leaf>0</leaf><leaf>1</leaf></node></node>
     	      <node><node><leaf>2</leaf><leaf>3</leaf></node>
                       <node><leaf>4</leaf><leaf>5</leaf></node></node>
     </node></node>
The content of leaf nodes is a single-digit string. In loftier terms, we compute a string-value of the document root.
  

Expat and SSAX

To make the comparison meaningful, we need to discuss input models of Expat and SSAX.

Expat is the fastest XML parser [Expat-tutorial]. It is written in C by James Clark. It is a SAX parser. An application that uses Expat must open an XML file or stream and read its blocks into memory. The application should pass these blocks to Expat, indicating the block size and if it is the last block of the document. An application can potentially load the whole document into memory and pass this single block to Expat. The parser is specifically optimized for such a scenario, because Expat uses shared substrings as much as possible. The first benchmark application, 'string-value.c', reads the whole document into memory, passes it to Expat and asks the parser to compute the string value. It takes 0.105 sec user time for string-value.c to handle an XML document that represents a full binary tree of depth 15 (32768 leaf nodes, total size 884,723 bytes). This is indeed fast.

The Expat input mode assumes that a calling application must know when the document ends. Otherwise, an application cannot read the document from a stream, in whole or in blocks. When the document to parse is read from a regular file, it is trivial to find out when we are finished reading. The OS will tell us. However, if we take a document from a (tcp) pipe, it may be impossible to tell offhand when to stop reading. Furthermore, if we unwittingly try to read one extra character, we become blocked and possibly deadlocked.

SSAX is also a SAX parser [SSAX]. It is written completely in Scheme. SSAX uses a different input model. Rather than relying on an application to feed data to it, SSAX reads characters from a given input port itself. SSAX reads ahead by no more than one character, and only when the parser is positive the character to read ahead must be available. SSAX does not need to be told when the document is ended. On the contrary, SSAX will tell you when it has finished parsing a root (or other) element. SSAX therefore is safe to be used on pipes, to process several documents in a file, or to handle selected parts of a document.

Therefore, to meaningfully compare SSAX with Expat, we need to modify the Expat application. string-value-by-one.c is such a modified benchmark. It loads an XML document into memory first. It then passes the content of that buffer one character at a time to Expat. This simulates the work of the SSAX parser. It takes 0.747 user seconds for string-value-by-one.c to handle the same XML document as above (full binary tree of depth 15).

A SSAX benchmark string-value-ssax.scm likewise loads an XML document first, opens the memory buffer as a string port and passes the port to SSAX. It takes 1.092 user seconds to handle the same document. string-value-ssax.scm was compiled by Bigloo 2.4a.

The complete code for the benchmark is part of the SSAX project at SourceForge [SSAX-benchmarks]


  

Summary of the performance benchmark

Let's summarize the performance.

string-value.c: XML-tree-depth-15 XML-tree-depth-16
user time, s 0.105 0.213
system time, s 0.016 0.022
page reclaims 241 483
 
string-value-by-one.c: XML-tree-depth-15 XML-tree-depth-16
user time, s 0.747 1.494
system time, s 0.014 0.012
page reclaims 25 49
 
string-value-ssax.scm: XML-tree-depth-15 XML-tree-depth-16
user time, s 1.092 2.170
system time, s 0.024 0.095
page reclaims 842 2353
 
File size, bytes 884,723 1,769,459

You can trust the precision of the timings: they were reported by getrusage() , using a microsecond virtual clock. All benchmarks were run on a FreeBSD 4.0-RELEASE system, Pentium III Xeon 500 MHz. The numbers above reflect activities that occur entirely in memory. There is no i/o whatsoever -- neither application i/o nor VM-related i/o. There were no page faults.

The most striking result is that a Scheme application is only 1.4 times slower than a comparable well-written C application. Only 1.4 times! Note Java applications are generally far slower than comparable C applications. Some XML performance benchmarks [Parser-benchmark] show that even the fastest Java XML parser (XP) is slower than Expat by an order of magnitude; Perl and Python parsers that are based on Expat [sic!] are slower than Expat by a factor of 20-25 (for large files). Thus the SSAX parser seems quite competitive in performance. Again, many thanks to Manuel Serrano for his excellent optimizing compiler, Bigloo. BTW, did I mention that SSAX is a pure functional parser? The whole parser and all of its handlers are completely referentially transparent.


  

Why SSAX?

Despite some similarities between SSAX and Expat (which came as a surprise to me myself), SSAX is not a Scheme clone of Expat. SSAX is intended to be a toolkit for various markup-related tasks. One example -- parsing of (potentially invalid) HTML -- has been demonstrated recently [HTML-parsing]. I was surprised myself at how easy it was to write an HTML parser/lexer using SSAX. The major difference between SSAX and Expat however is that SSAX has a better interface.


  

SSAX has a better API

The application interface of SSAX has several advantages over Expat API. SSAX minimizes the amount of application-specific state that has to be shared among user-supplied event handlers. In particular, SSAX makes the maintenance of an element stack unnecessary, which eliminates several classes of common bugs. SSAX is written in a pure-functional subset of Scheme. Therefore, the event handlers are referentially transparent, which makes them easier for a programmer to write and to reason about. The more expressive, reliable and easier to use application interface for the event-driven XML parsing is the outcome of implementing the parsing engine as an enhanced tree fold combinator, which fully captures the control pattern of the depth-first tree traversal.

A tutorial article about Expat [Expat-tutorial] explains well how Expat is supposed to be used. A user application most certainly has to have

a good stack mechanism in order to keep track of current context... The things you're likely to want to keep on a stack are the currently opened element and it's attributes. You push this information onto the stack in the start handler and you pop it off in the end handler.

Compare that description with the following SSAX application, which converts an XML document to a simplified (for the sake of clarity) SXML:

     (define (simple-XML->SXML port)
       (reverse
        ((SSAX:make-parser
          NEW-LEVEL-SEED
          (lambda (elem-gi attributes namespaces expected-content seed)
          '())
     
          FINISH-ELEMENT
          (lambda (elem-gi attributes namespaces parent-seed seed)
            (cons (cons elem-gi (reverse seed)) parent-seed))
     
          CHAR-DATA-HANDLER
          (lambda (string1 string2 seed)
            (if (string-null? string2) (cons string1 seed)
                (cons* string2 string1 seed)))
     
         )
         port '())))
As you see, simple-XML->SXML does not need any stack to keep track of the current context. There is no need to maintain the stack across several callbacks and check for underflows and other errors. The SSAX callbacks hardly do anything at all. The simpler the handlers are, the easier it is to write them and to reason about them.
  

Answers to comments


  

Choice of XML trees as input files for the benchmark

XML representation of trees has many deeply-nested elements. Processing such documents exercises the parser engine. In contrast, marked-up Shakespeare's plays (another popular choice for benchmark input files) are comprised mostly of character data. They do not exert the element parsing engine.


  

Is the benchmark fair to Expat?

The most striking result is that a Scheme application is only 1.4 times slower than a comparable well-written C application.
Yes, but only if you slow down the C application by a factor of 7 first.
SSAX could have read the whole document into memory and parse it from there, as HaXml does. As the benchmark shows, parsing from memory and avoiding any string copies increases the performance by a factor of 7. Yet I did not feel that solution satisfactory. I do parse XML documents from pipes; oftentimes the documents are generated on the fly so I do not know what the final size of the document will be. In that situation, I can not load the document into memory without "pre-parsing" of the input stream first. Furthermore, SSAX can handle documents that are simply too big to load into memory.
  

Is SSAX really referentially transparent?

It is emphasized in SSAX title comments and elsewhere that the input port is treated as a linear, read-once parameter. This technique is similar to that of Clean. I/O in Clean is referentially transparent.

Because the port is treated in SSAX as a read-once parameter,

     (let ((result (foo port))) ...)
is equivalent to
     (let-values (((result new-port) (foo* port))) ... )
and can be mechanically transformed into the latter if needed. Transformations of that kind cannot generally be performed on code that uses assignments indiscriminately (which is not referentially transparent, that is).

In short, SSAX is as referentially transparent as Clean is -- which has been proven to be.


  

What about performance of Expat if it were to wrapped in Scheme?

Just for the sake of argument, if one were to wrap Expat into Scheme primitive objects, and invoke Expat from an interpreter, how do you think it will perform?

If you invoke Expat from a Scheme interpreter, you should expect the the performance similar to that of Perl or Python wrappers around Expat [Parser-benchmark].

The wrappers start off well for small documents; but as the document size increases to around 3 MB, the wrapped Expat becomes 20-30 times slower than the original Expat. The reason is the cost of invoking of Perl/Python callbacks from the Expat code (and the related cost of translating strings and other data across the language barrier).


  

References

[Expat-tutorial] Clark Cooper. Using Expat. September 01, 1999.
<http://www.xml.com/pub/a/1999/09/expat/index.html>

[SSAX] SSAX and SXML at SourceForge.
<http://ssax.sourceforge.net/>

[SSAX-benchmarks] SSAX and Expat benchmarks.
<http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/ssax/SSAX/benchmarks/>

[Parser-benchmark] Clark Cooper. Summary of XML Parser Performance Testing. May 05, 1999.
<http://www.xml.com/pub/a/Benchmark/exec.html>

[HTML-parsing] A permissive HTML parser.
<http://sourceforge.net/forum/forum.php?forum_id=126763>

[Expat-vs-SSAX thread] Expat vs. SSAX, C vs. Scheme (Bigloo) Two articles posted on a newsgroup comp.lang.scheme on Thu, 1 Nov 2001 13:41:22 -0800 and Mon, 5 Nov 2001 13:40:16 -0800.


Last updated December 7, 2001

This site's top page is http://pobox.com/~oleg/ftp/

oleg@pobox.com or oleg@acm.org or oleg@computer.org
Your comments, problem reports, questions are very welcome!

Converted from SXML by SXML->HTML