Saturday, July 30. 2016
wbxml-stream: Version 0.4.0
Today another entry about the wbxml-stream project is presented. Although the project is not very important or used I wanted to stabilize it and give a version more or less complete. For that reason the version 0.3.0 implemented missing features in the StAX API and today's one offers more languages. My idea was supporting the same languages that the C counterpart libwbxml manages. So the new version 0.4.0 gives much more languages (WML, WV CSP 1.2, SYNCML 1.0,...) and, collaterally, an important new feature is covered.
The first idea for this version was just presenting more property files to support all the languages. Until now, wbxml-stream just supported the languages that were tested in the C counterpart (I mean, there were XML sample files to test the correct implementation). In theory no more features were going to be added but, as always, the WBML protocol is a box of never-ending surprises. One of the new languages is WML (Wireless Markup Language), and this language was used long ago to present pages in old phones. It uses common WBXML features but adds one particular detail (only for WML). WML uses variables (it is a little programming language) and the variables can be of three types: escape, un-escape and no-escape (please don't ask me what each type means). And a variable should be encoded into WBXML using EXT_I or EXT_T extensions. In general, the WBXML protocol defines three extensions: EXT_I, EXT_T and EXT. The definition in the specification is as follows:
extension = [switchPage] (( EXT_I termstr ) | ( EXT_T index ) | EXT) switchPage = SWTICH_PAGE pageindex pageindex = u_int8 termstr = charset-dependent string with termination (0x00) index = mb_u_int32 // integer index into string table.
And there is a chapter dedicated for them (5.8.4.2. Global Extension Tokens). It says that global extensions are available for document-specific use (the language can use them for whatever it wants). Each type has in turn three versions (EXT_I_[012], EXT_T_[012] and EXT_[012]) defined with a specific token number. The EXT_I extensions have a string associated, the EXT_T a long number (thought to be used with the string table) and the EXT ones just are the token itself (no data associated). In summary, with the commented characteristics, a language can use the global extensions for any usage. In my humble opinion it is better not to use any extension (same thing as for opaques), because they add custom features that need to be implemented in any library (I do not understand why WBXML is so fucking extensible).
WML uses EXT_I and EXT_T extensions to encode the variable names. Whenever you find a variable (in WML they are in the style of $var or $(var), and you can specify the type using $(var:escape) for example) it should be transformed into a EXT_I or EXT_T. The variable name is put in the string (EXT_I) or in the string table (EXT_T index marks the position in the table). Besides the three different versions in each extension (0, 1 and 2) are used to specify the type of the variable (escape, un-escape or no-escape). The idea behind that (I suppose) is saving in document length (using EXT_T is better for that purpose).
Previously the wbxml-stream library just manages EXT_T_0 extensions and because they were used by the WV (Wireless Village). This language defines some tokens to change arbitrary strings into EXT_T_0 extensions (for example "https://" is defined as the extension token 0x0f) in order to get shorter values and attributes. The C library libwbxml integrates this definition as a common WBXML feature (the extension tokens are defined in a general structure), so, because I copied a lot of ideas from the C implementation, I did exactly the same. Now it is clear that this is a specific feature of the WV-CSP language and no a general WBXML one. But I am going to leave things like this, any language can still define extension strings that will be replaced for EXT_T_0 tokens and the encoding is done in WbXmlEncoder core class.
What it is different now, in the new version, is that there are three new interfaces to define what the language does when any extension token is found. There are three because one interface is defined for each type of extension (EXT_I receives the string, EXT_T the long and EXT receives nothing).
public interface ExtensionIPlugin {
public WbXmlContent parseContent(WbXmlParser parser, String tagName, byte ext, String value) throws IOException;
public String parseAttribute(WbXmlParser parser, String attrName, byte ext, String value) throws IOException;
}
public interface ExtensionTPlugin {
public WbXmlContent parseContent(WbXmlParser parser, String tagName, byte ext, long value) throws IOException;
public String parseAttribute(WbXmlParser parser, String attrName, byte ext, long value) throws IOException;
}
public interface ExtensionPlugin {
public WbXmlContent parseContent(WbXmlParser parser, String tagName, byte ext) throws IOException;
public String parseAttribute(WbXmlParser parser, String attrName, byte ext) throws IOException;
}
I think it is more or less clear what is the idea. For example in WML if you find a EXT_I_0 token with the string "var", it should be replaced for a variable called var of type escape, that is "$(var:escape)". So the WML language adds a class that implements both ExtensionIPlugin and extensionTPlugin. Besides there are new property names to specify the classes that will handle those extensions. For example in the WML language the EXT_I and EXT_T options are used to mark the class that handles them:
wbxml.extension.EXT_I=es.rickyepoderi.wbxml.document.opaque.WMLVariableExtension wbxml.extension.EXT_T=es.rickyepoderi.wbxml.document.opaque.WMLVariableExtension
The wbxml-stream implementation when finds a extension token and no class is defined to handle it just throws an exception.
Finally, with all the stuff shown before we know how to parse/decode a WBXML document that uses extensions but... How do we encode them? I mean, how is a variable in WML written using an extension instead of a normal string? In WV-CSP is done in the core but... How to do it normally? I just recommend to use an opaque plugin, I know it is quite strange but it is easier and I do not want to complicate even more the API for the moment. An opaque plugin was initially thought to replace any attribute value or tag element into an opaque (both ways, parsing and encoding), but you can re-use the interface to encode extensions too (the encode method is used and the parse will never be called because no opaque will be found). This way the WML plugin searches for variable strings and replaces the findings with EXT_I or EXT_T (depending the user of string table defined). I know it is quite weird but I think is the easiest way to manage it, maybe in future versions an API change is needed to reorganize extensions and opaques.
So, just to summarize how to habdle extensions in wbxml-stream library, I am going to list what WML and WV-CSP do (the only two languages that manage extensions):
WML gives one class that implements ExtensionIPlugin and ExtensionTPlugin and it is associated in the language definition. This way when parsing a WBXML document that class is called to handle any extension is found. The class transforms the extension in a proper variable in WML. But the encoding is doing differently, two opaque plugins (attribute and content) are used instead, the classes only perform the encoding part searching for variables in the string and replacing them for extensions. The parse method is not needed because it will never be called (because no opaque will be found, no one is written).
WV-CSP gives an ExtensionTPlugin (it searches for the corresponding string using the number in the EXT_T) but no opaque plugin. The library gives that feature (extension replacement using EXT_T_0) by default. As I explained before I copied libwbxml implementation when I had not understood completely WBXML extensions. I decided to not delete that feature (any language can do that) and, therefore, no opaque plugin is needed because the translation is done in the WbXmlEncoder class (but think about it as an exception, you normally will need and opaque plugin if you want to write extensions). For the same reason the ExtesionTPlugin that WV language uses is called DefaultExtTExtension instead of using a specific WV name.
And that's all. The wbxml-stream 0.4.0 is supposed to be feature complete, exactly at the same level of the counterpart libwbxml. The new languages are not so well tested because I did not find XML sample documents for them. I tried to, at least, have one XML and WBXML document for all languages but in some cases it is clearly insufficient. For that reason the library remains using versions 0.X.X. As usual, if you (for some strange circumstance) need to use a WBXML parser/encoding library in Java, please consider to use the wbxml-stream project and report any issue using github.
Regards!
Saturday, July 9. 2016
wbxml-stream: Version 0.3.0
After the version 0.2.0 of the wbxml-stream library, I decided to continue working in the project for completing the StAX (Streaming API for XML) implementation for the WBXML format. StAX is quite complicated and not very clear in some parts (it has some flaws that make me think it was not created by Sun but by other member of the community; this is only my suspicion). Basically, for the new version I wanted to implement three changes.
How the ATTRIBUTE event should be used. The StAX framework, the XMLStreamReader more concretely, parses the XML file generating events (for example START_DOCUMENT when the document is just opened, START_ELEMENT when an XML tag is found or CHARACTERS when simple text is reached). Besides there are some methods associated with each event to recover the data associated to it (for example getText in CHARACTERS to retrieve the string value of the event, or getName in START_ELEMENT to know the name of the tag). The developer should iterate over the events returned by the reader and pick up the things he/she wants. There are another reader, the XMLEventReader which is mainly a wrapper of the previous one, that returns event objects that combines both (the event and the methods for retrieving the data associated to it).
One of those events defined in the API is the ATTRIBUTE. In my first implementation, that event was returned just one time for each tag that had one or more attributes (the methods to access the attributes are always indexed, so I thought that one time was enough). But when I switched to implement the event reader, the Attribute event class just represent one attribute (name and value). Something was wrong. At that time I decided to not return attribute events at all (the easiest solution at that time). It worked because the ATTRIBUTE event was not used in general by other parts of the JDK.
For version 0.3.0 I wanted to do it properly and I changed to return as many events as attributes were in the element. I thought that that was the proper implementation and it also worked well. But then I realized that I was wrong and my first implementation was the good one. The javadoc for the Attribute event class says the following: "Attributes are reported as a set of events accessible from a StartElement. Other applications may report Attributes as first-order events, for example as the results of an XPath expression."
So (for what I understand) the ATTRIBUTE event is never returned by the readers, it is only an auxiliary event that should be accessed from the START_ELEMENT. So I reverted back to my first implementation. But now thinking is the proper one.
I also wanted to re-write the WbXmlEventReader. In the previous versions I had implemented my own XMLEvent classes. But I later realized that there are two classes in the StAX API that are thought to be used in the event reader:
The XMLEventAllocator which is basically an object to convert the current state of a stream reader in the corresponding XMLEvent.
The XMLEventFactory which is a factory to construct events (and used by the allocator).
So I could avoid all my internal implementation and simply use an allocator to create the events. Other weird thing with the StAX interface is that you can obtain the default event factory (XMLEventFactory.newFactory method) but not the default allocator (there is no method to get the default allocator of an implementation). So I needed to implement my own allocator copying the JDK default class.
Now the WbXmlEventReader uses an WbXmlEventAllocator to parse the internal stream reader state and convert it into event objects. You can assign other allocator using the standard methods defined in the factory.
The main feature added in version 0.3.0 is the filtered readers. The StAX API is unnecessarily complicated and it permits to create filters over the two types of readers (stream and event). The idea is simple, you can add a filter that only accepts some types of events and, this way, the methods to iterate the events just avoid all the rest of events not accepted by the filter (the methods affected should be next, hasNext and peek). The idea seems interesting but is quite difficult to implement and (I think) not used very much. Why complicating an API like this? In the previous versions those filtered readers are simply not implemented.
With the target of completing the library I decided long time ago to implement those readers too. I used different ways for the two types of readers:
The basic stream reader was only modified to permit to save the current state of the parsing (backup) and restore it later (restore). This way you can save a point in the reader to move back to the same position later.
With that modification a wrapper class was made. The filtered stream reader just manages the next position using the backup/restore system (the next position is calculated with the following event that is accepted by the filter). I did not want to complicate more the basic implementation and this idea was simple and functional.
The event reader uses internally the basic stream reader. In this case I complicated a bit the event reader implementation (using again backup/restore methods) to get the next accepted position accepted by the filter. So, more or less, the event reader uses the same idea than the filtered stream reader (if it worked once...).
Now the wbxml-stream library lets you use a filtered reader (stream or event) to read a WBXML document. Here I present one of the unit-tests inside the library that checks the correct implementation with a SI document. Imagine you have to parse the a SI XML document but in WBXML format:
<?xml version="1.0"?> <!DOCTYPE si PUBLIC "-//WAPFORUM//DTD SI 1.0//EN" "http://www.wapforum.org/DTD/si.dtd"> <si> <indication href="http://www.xyz.com/email/123/abc.wml" created="1999-06-25T15:23:15Z" si-expires="1999-06-30T00:00:00Z"> You have 4 new emails </indication> </si>
You can think about a StreamFilter that only manages the START_ELEMENT events, using a filter like that the previous document could be read this way:
XMLInputFactory f = new WbXmlInputFactory(); XMLStreamReader reader = f.createXMLStreamReader(new FileInputStream("si-001.wbxml")); StreamFilter filter = new StreamFilter() { @Override public boolean accept(XMLStreamReader reader) { return reader.getEventType() == XMLStreamConstants.START_ELEMENT; } }; reader = f.createFilteredReader(reader, filter); // it should be the start document Assert.assertEquals(XMLStreamConstants.START_DOCUMENT, reader.getEventType()); // read next => it should be "si" Assert.assertEquals(XMLStreamConstants.START_ELEMENT, reader.next()); Assert.assertEquals("si", reader.getName().getLocalPart()); Assert.assertEquals(0, reader.getAttributeCount()); // next it should be "indication" Assert.assertEquals(XMLStreamConstants.START_ELEMENT, reader.next()); Assert.assertEquals("indication", reader.getName().getLocalPart()); Assert.assertEquals(3, reader.getAttributeCount()); Assert.assertEquals("1999-06-25T15:23:15Z", reader.getAttributeValue(null, "created")); Assert.assertEquals("http://www.xyz.com/email/123/abc.wml", reader.getAttributeValue(null, "href")); Assert.assertEquals("1999-06-30T00:00:00Z", reader.getAttributeValue(null, "si-expires")); Assert.assertEquals("You have 4 new emails", reader.getElementText()); Assert.assertEquals(XMLStreamConstants.END_ELEMENT, reader.getEventType()); // no more start elements Assert.assertFalse(reader.hasNext());
And that is all. The new version 0.3.0 of wbxml-stream adds the filtered readers to the implementation as its main new feature. It also improves the WbXmlEventReader in several aspects and fixes. If you are currently using this library please try the new version downloading it from here. As always any problem or enhancement could be reported in github. Mi personal feeling about StAX API is changing day by day. Now I think that this API is ridiculously complicated. It has a lot of ways of doing exactly the same with no great benefit (stream, event, filters,...), sometimes it is not very clear (the ATTRIBUTE event is an example) and it has some flaws (why I cannot obtain the default XMLEventAllocator?). I do not know, maybe I am not understanding something, but with the XmlStreamReader class you have more than enough to read any XML document (it is also the fastest way), all the rest is just cumbersome stuff.
Regards!
Tuesday, June 7. 2016
WBXML stream: Version 0.2.0
Some days ago pwnslinger opened the first issue in the wbxml-stream project. If you remember that project is my little effort to provide a wbxml parser/encoder for java/StAX. Mainly the problem was that the library did not parse correctly WBXML documents in version 1.1. The WBXML standard has four versions, from 1.0 to 1.3, and there are subtle differences between them. The version 0.1.0 of the wbxml-stream library just managed version 1.3 (only 1.3 documents could be encoded and the previous versions were just parsed like if they were the last one, no differences were taken into account). Nevertheless version 1.1 had an issue in the defined enumeration (a copy/paste problem :-/ ) and it was not recognized by the library.
That issue made me improve the implementation in order to properly manage the four different versions of the specification. This way version 0.2.0 has been released with the feature of handling better with previous versions of the standard. Now it is possible to encode a WBXML document in any version. Besides the parsing/encoding of documents take into account the specific characteristics between versions. The first version 1.0 does not add the charset of the document (it just manages the unknown encoding) and it does not recognize opaques. Version 1.1 adds tag opaques and charset/encodings but attribute opaques are added later in version 1.2. Version 1.2 also adds page switches to increase the number of tags, attributes and values a definition can handle. There is one difference between version 1.0 and 1.1 which wbxml-stream does not manage. In version 1.1 the document body is defined as *pi element *pi (an element with optional processing instructions before and after it), in contrast, in version 1.0 the body was 1*content (one or more content, which in turn can be an element, a string, a extension, an entity or a processing instruction). This previous definition is quite weird, a WBXML document can be only a string or an entity, which is clearly not a valid XML document. For that reason I decided to forget about this difference (a WBXML document version 1.0 with that strange content will not be correctly parsed, throwing an exception for sure). A page in the project wiki summarizes those differences if you are interested in the details.
The new version of the library also adds some improvements in the management of encoding (now the default charset, encoding for unknown, is UTF-8 instead of ASCII) and numeric character references (things like ñ or ñ to reference a character ñ). In the latter I am not sure if everything is right and maybe a new version will be needed but, for sure, it is in a better condition than in the previous version. The WbXmlOutputFactory has been updated in order to receive the version we want to write the WBXML with (property es.rickyepoderi.wbxml.stream.version). Besides the command Xml2WbXml now admits two more options (-c or --charsert and -v or --version) to convert the XML file to WBXML using the encoding and the version specified.
Here it is a little snippet to use the new version 0.2.0 to convert an SL xml file into WBXML v1.1 using a specific encoding.
// read the XML using DOM
InputStream in = new FileInputStream("sl-001.xml");
DocumentBuilderFactory domFact = DocumentBuilderFactory.newInstance();
domFact.setNamespaceAware(true);
domFact.setIgnoringElementContentWhitespace(true);
DocumentBuilder domBuilder = domFact.newDocumentBuilder();
Document doc = domBuilder.parse(in);
Element element = doc.getDocumentElement();
// locate the definition of the WBXML using the name
WbXmlDefinition definition = WbXmlInitialization.getDefinitionByName("SL 1.0");
// create the StAX stream writer using the definition
OutputStream out = new FileOutputStream("sl-001.wbxml");
XMLOutputFactory fact = new WbXmlOutputFactory();
fact.setProperty(WbXmlOutputFactory.DEFINITION_PROPERTY, definition);
fact.setProperty(WbXmlOutputFactory.VERSION_PROPERTY, WbXmlVersion.VERSION_1_1);
XMLStreamWriter xmlStreamWriter = fact.createXMLStreamWriter(out, "ISO-8859-1");
// create a transformer to convert DOM into StAX
Transformer xformer = TransformerFactory.newInstance().newTransformer();
Source domSource = new DOMSource(doc);
StAXResult staxResult = new StAXResult(xmlStreamWriter);
xformer.transform(domSource, staxResult);
And here I present a execution of the command trying to convert a SI XML file into WBXML version 1.0 and encoding ISO-8859-1 (remember that v1.0 does not use encoding, therefore any character outside ascii is compromised). As the SI document uses some opaques (which are not defined in version 1.0) the implementation avoids the use of the opaque and some warning messages are displayed (in general issues with versions generate warnings or throw exceptions, in this case the encoding does not use the opaque and warns the user because the resulting document probably is invalid).
$ java -cp wbxml-stream-0.2.0.jar es.rickyepoderi.wbxml.tools.Xml2WbXml -d "SI 1.0" -v 1.0 -c ISO-8859-1 si.xml si.wbxml Jun 06, 2016 7:23:59 PM es.rickyepoderi.wbxml.document.WbXmlEncoder encode WARNING: Opaque not used for attribute "created" in element "indication" because version "1.0" does not accept attribute opaques. Jun 06, 2016 7:23:59 PM es.rickyepoderi.wbxml.document.WbXmlEncoder encode WARNING: Opaque not used for attribute "si-expires" in element "indication" because version "1.0" does not accept attribute opaques.
And that is all. If you are using wbxml-stream library for something please try to use the new version because it integrates a new nice feature (version management) and some minor improvements and bug fixes. Stay connected for more news about the project.
Cheerio!
Comments