Java Language Parsing a document using the StAX API


Considering the following document :

<?xml version='1.0' encoding='UTF-8' ?>
   <book id='1'>Effective Java</book>
   <book id='2'>Java Concurrency In Practice</book>
   <notABook id='3'>This is not a book element</notABook>

One can use the following code to parse it and build a map of book titles by book id.

import java.util.HashMap;
import java.util.Map;

public class StaxDemo {

public static void main(String[] args) throws Exception {
    String xmlDocument = "<?xml version='1.0' encoding='UTF-8' ?>"
            + "<library>"
                + "<book id='1'>Effective Java</book>"
                + "<book id='2'>Java Concurrency In Practice</book>"
                + "<notABook id='3'>This is not a book element </notABook>"
            + "</library>";

    XMLInputFactory xmlInputFactory = XMLInputFactory.newFactory();
    // Various flavors are possible, e.g. from an InputStream, a Source, ...
    XMLStreamReader xmlStreamReader = xmlInputFactory.createXMLStreamReader(new StringReader(xmlDocument));

    Map<Integer, String> bookTitlesById = new HashMap<>();

    // We go through each event using a loop
    while (xmlStreamReader.hasNext()) {
        switch (xmlStreamReader.getEventType()) {
            case XMLStreamConstants.START_ELEMENT:
                System.out.println("Found start of element: " + xmlStreamReader.getLocalName());
                // Check if we are at the start of a <book> element
                if ("book".equals(xmlStreamReader.getLocalName())) {
                    int bookId = Integer.parseInt(xmlStreamReader.getAttributeValue("", "id"));
                    String bookTitle = xmlStreamReader.getElementText();
                    bookTitlesById.put(bookId, bookTitle);
            // A bunch of other things are possible : comments, processing instructions, Whitespace...


This outputs :

Found start of element: library
Found start of element: book
Found start of element: book
Found start of element: notABook
{1=Effective Java, 2=Java Concurrency In Practice}

In this sample, one must be carreful of a few things :

  1. THe use of xmlStreamReader.getAttributeValue works because we have checked first that the parser is in the START_ELEMENT state. In evey other states (except ATTRIBUTES), the parser is mandated to throw IllegalStateException, because attributes can only appear at the beginning of elements.

  2. same goes for xmlStreamReader.getTextContent(), it works because we are at a START_ELEMENT and we know in this document that the <book> element has no non-text child nodes.

For more complex documents parsing (deeper, nested elements, ...), it is a good practice to "delegate" the parser to sub-methods or other objets, e.g. have a BookParser class or method, and have it deal with every element from the START_ELEMENT to the END_ELEMENT of the book XML tag.

One can also use a Stack object to keep around important datas up and down the tree.