Saturday, July 9. 2016
wbxml-stream: Version 0.3.0
After the version 0.2.0 of the wbxml-stream library, I decided to continue working in the project for completing the StAX (Streaming API for XML) implementation for the WBXML format. StAX is quite complicated and not very clear in some parts (it has some flaws that make me think it was not created by Sun but by other member of the community; this is only my suspicion). Basically, for the new version I wanted to implement three changes.
How the ATTRIBUTE event should be used. The StAX framework, the XMLStreamReader more concretely, parses the XML file generating events (for example START_DOCUMENT when the document is just opened, START_ELEMENT when an XML tag is found or CHARACTERS when simple text is reached). Besides there are some methods associated with each event to recover the data associated to it (for example getText in CHARACTERS to retrieve the string value of the event, or getName in START_ELEMENT to know the name of the tag). The developer should iterate over the events returned by the reader and pick up the things he/she wants. There are another reader, the XMLEventReader which is mainly a wrapper of the previous one, that returns event objects that combines both (the event and the methods for retrieving the data associated to it).
One of those events defined in the API is the ATTRIBUTE. In my first implementation, that event was returned just one time for each tag that had one or more attributes (the methods to access the attributes are always indexed, so I thought that one time was enough). But when I switched to implement the event reader, the Attribute event class just represent one attribute (name and value). Something was wrong. At that time I decided to not return attribute events at all (the easiest solution at that time). It worked because the ATTRIBUTE event was not used in general by other parts of the JDK.
For version 0.3.0 I wanted to do it properly and I changed to return as many events as attributes were in the element. I thought that that was the proper implementation and it also worked well. But then I realized that I was wrong and my first implementation was the good one. The javadoc for the Attribute event class says the following: "Attributes are reported as a set of events accessible from a StartElement. Other applications may report Attributes as first-order events, for example as the results of an XPath expression."
So (for what I understand) the ATTRIBUTE event is never returned by the readers, it is only an auxiliary event that should be accessed from the START_ELEMENT. So I reverted back to my first implementation. But now thinking is the proper one.
I also wanted to re-write the WbXmlEventReader. In the previous versions I had implemented my own XMLEvent classes. But I later realized that there are two classes in the StAX API that are thought to be used in the event reader:
The XMLEventAllocator which is basically an object to convert the current state of a stream reader in the corresponding XMLEvent.
The XMLEventFactory which is a factory to construct events (and used by the allocator).
So I could avoid all my internal implementation and simply use an allocator to create the events. Other weird thing with the StAX interface is that you can obtain the default event factory (XMLEventFactory.newFactory method) but not the default allocator (there is no method to get the default allocator of an implementation). So I needed to implement my own allocator copying the JDK default class.
Now the WbXmlEventReader uses an WbXmlEventAllocator to parse the internal stream reader state and convert it into event objects. You can assign other allocator using the standard methods defined in the factory.
The main feature added in version 0.3.0 is the filtered readers. The StAX API is unnecessarily complicated and it permits to create filters over the two types of readers (stream and event). The idea is simple, you can add a filter that only accepts some types of events and, this way, the methods to iterate the events just avoid all the rest of events not accepted by the filter (the methods affected should be next, hasNext and peek). The idea seems interesting but is quite difficult to implement and (I think) not used very much. Why complicating an API like this? In the previous versions those filtered readers are simply not implemented.
With the target of completing the library I decided long time ago to implement those readers too. I used different ways for the two types of readers:
The basic stream reader was only modified to permit to save the current state of the parsing (backup) and restore it later (restore). This way you can save a point in the reader to move back to the same position later.
With that modification a wrapper class was made. The filtered stream reader just manages the next position using the backup/restore system (the next position is calculated with the following event that is accepted by the filter). I did not want to complicate more the basic implementation and this idea was simple and functional.
The event reader uses internally the basic stream reader. In this case I complicated a bit the event reader implementation (using again backup/restore methods) to get the next accepted position accepted by the filter. So, more or less, the event reader uses the same idea than the filtered stream reader (if it worked once...).
Now the wbxml-stream library lets you use a filtered reader (stream or event) to read a WBXML document. Here I present one of the unit-tests inside the library that checks the correct implementation with a SI document. Imagine you have to parse the a SI XML document but in WBXML format:
<?xml version="1.0"?> <!DOCTYPE si PUBLIC "-//WAPFORUM//DTD SI 1.0//EN" "http://www.wapforum.org/DTD/si.dtd"> <si> <indication href="http://www.xyz.com/email/123/abc.wml" created="1999-06-25T15:23:15Z" si-expires="1999-06-30T00:00:00Z"> You have 4 new emails </indication> </si>
You can think about a StreamFilter that only manages the START_ELEMENT events, using a filter like that the previous document could be read this way:
XMLInputFactory f = new WbXmlInputFactory(); XMLStreamReader reader = f.createXMLStreamReader(new FileInputStream("si-001.wbxml")); StreamFilter filter = new StreamFilter() { @Override public boolean accept(XMLStreamReader reader) { return reader.getEventType() == XMLStreamConstants.START_ELEMENT; } }; reader = f.createFilteredReader(reader, filter); // it should be the start document Assert.assertEquals(XMLStreamConstants.START_DOCUMENT, reader.getEventType()); // read next => it should be "si" Assert.assertEquals(XMLStreamConstants.START_ELEMENT, reader.next()); Assert.assertEquals("si", reader.getName().getLocalPart()); Assert.assertEquals(0, reader.getAttributeCount()); // next it should be "indication" Assert.assertEquals(XMLStreamConstants.START_ELEMENT, reader.next()); Assert.assertEquals("indication", reader.getName().getLocalPart()); Assert.assertEquals(3, reader.getAttributeCount()); Assert.assertEquals("1999-06-25T15:23:15Z", reader.getAttributeValue(null, "created")); Assert.assertEquals("http://www.xyz.com/email/123/abc.wml", reader.getAttributeValue(null, "href")); Assert.assertEquals("1999-06-30T00:00:00Z", reader.getAttributeValue(null, "si-expires")); Assert.assertEquals("You have 4 new emails", reader.getElementText()); Assert.assertEquals(XMLStreamConstants.END_ELEMENT, reader.getEventType()); // no more start elements Assert.assertFalse(reader.hasNext());
And that is all. The new version 0.3.0 of wbxml-stream adds the filtered readers to the implementation as its main new feature. It also improves the WbXmlEventReader in several aspects and fixes. If you are currently using this library please try the new version downloading it from here. As always any problem or enhancement could be reported in github. Mi personal feeling about StAX API is changing day by day. Now I think that this API is ridiculously complicated. It has a lot of ways of doing exactly the same with no great benefit (stream, event, filters,...), sometimes it is not very clear (the ATTRIBUTE event is an example) and it has some flaws (why I cannot obtain the default XMLEventAllocator?). I do not know, maybe I am not understanding something, but with the XmlStreamReader class you have more than enough to read any XML document (it is also the fastest way), all the rest is just cumbersome stuff.
Regards!
Comments