Java Language XML Parsing using the JAXP APIs


XML Parsing is the interpretation of XML documents in order to manipulate their content using sensible constructs, be they "nodes", "attributes", "documents", "namespaces", or events related to these constructs.

Java has a native API for XML document handling, called JAXP, or Java API for XML Processing. JAXP and a reference implementation has been bundled with every Java release since Java 1.4 (JAXP v1.1) and has evolved since. Java 8 shipped with JAXP version 1.6.

The API provides different ways of interacting with XML documents, which are :

  • The DOM interface (Document Object Model)
  • The SAX interface (Simple API for XML)
  • The StAX interface (Streaming API for XML)

Principles of the DOM interface

The DOM interface aims to provide a W3C DOM compliant way of interpreting XML. Various versions of JAXP have supported various DOM Levels of specification (up to level 3).

Under the Document Object Model interface, an XML document is represented as a tree, starting with the "Document Element". The base type of the API is the Node type, it allows to navigate from a Node to its parent, its children, or its siblings (although, not all Nodes can have children, for example, Text nodes are final in the tree, and never have childre). XML tags are represented as Elements, which notably extend the Node with attribute-related methods.

The DOM interface is very usefull since it allows a "one line" parsing of XML documents as trees, and allows easy modification of the constructed tree (node addition, suppression, copying, ...), and finally its serialization (back to disk) post modifications. This comes at a price, though : the tree resides in memory, therefore, DOM trees are not always practical for huge XML documents. Furthermore, the construction of the tree is not always the fastest way of dealing with XML content, especially if one is not interested in all parts of the XML document.

Principles of the SAX interface

The SAX API is an event-oriented API to deal with XML documents. Under this model, the components of an XML documents are interpreted as events (e.g. "a tag has been opened", "a tag has been closed", "a text node has been encountered", "a comment has been encountered")...

The SAX API uses a "push parsing" approach, where a SAX Parser is responsible for interpreting the XML document, and invokes methods on a delegate (a ContentHandler) to deal with whatever event is found in the XML document. Usually, one never writes a parser, but one provides a handler to gather all needed informations from the XML document.

The SAX interface overcomes the DOM interface's limitations by keeping only the minimum necessary data at the parser level (e.g. namespaces contexts, validation state), therefore, only informations that are kept by the ContentHandler - for which you, the developer, is responsible - are held into memory. The tradeoff is that there is no way of "going back in time/the XML document" with such an approach : while DOM allows a Node to go back to its parent, there is no such possibility in SAX.

Principles of the StAX interface

The StAX API takes a similar approach to processing XML as the SAX API (that is, event driven), the only very significative difference being that StAX is a pull parser (where SAX was a push parser). In SAX, the Parser is in control, and uses callbacks on the ContentHandler. In Stax, you call the parser, and control when/if you want to obtain the next XML "event".

The API starts with XMLStreamReader (or XMLEventReader), which are the gateways through which the developer can ask nextEvent(), in an iterator-style way.