Saturday, July 30. 2016
wbxml-stream: Version 0.4.0
Today another entry about the wbxml-stream project is presented. Although the project is not very important or used I wanted to stabilize it and give a version more or less complete. For that reason the version 0.3.0 implemented missing features in the StAX API and today's one offers more languages. My idea was supporting the same languages that the C counterpart libwbxml manages. So the new version 0.4.0 gives much more languages (WML, WV CSP 1.2, SYNCML 1.0,...) and, collaterally, an important new feature is covered.
The first idea for this version was just presenting more property files to support all the languages. Until now, wbxml-stream just supported the languages that were tested in the C counterpart (I mean, there were XML sample files to test the correct implementation). In theory no more features were going to be added but, as always, the WBML protocol is a box of never-ending surprises. One of the new languages is WML (Wireless Markup Language), and this language was used long ago to present pages in old phones. It uses common WBXML features but adds one particular detail (only for WML). WML uses variables (it is a little programming language) and the variables can be of three types: escape, un-escape and no-escape (please don't ask me what each type means). And a variable should be encoded into WBXML using EXT_I or EXT_T extensions. In general, the WBXML protocol defines three extensions: EXT_I, EXT_T and EXT. The definition in the specification is as follows:
extension = [switchPage] (( EXT_I termstr ) | ( EXT_T index ) | EXT) switchPage = SWTICH_PAGE pageindex pageindex = u_int8 termstr = charset-dependent string with termination (0x00) index = mb_u_int32 // integer index into string table.
And there is a chapter dedicated for them (5.8.4.2. Global Extension Tokens). It says that global extensions are available for document-specific use (the language can use them for whatever it wants). Each type has in turn three versions (EXT_I_[012], EXT_T_[012] and EXT_[012]) defined with a specific token number. The EXT_I extensions have a string associated, the EXT_T a long number (thought to be used with the string table) and the EXT ones just are the token itself (no data associated). In summary, with the commented characteristics, a language can use the global extensions for any usage. In my humble opinion it is better not to use any extension (same thing as for opaques), because they add custom features that need to be implemented in any library (I do not understand why WBXML is so fucking extensible).
WML uses EXT_I and EXT_T extensions to encode the variable names. Whenever you find a variable (in WML they are in the style of $var or $(var), and you can specify the type using $(var:escape) for example) it should be transformed into a EXT_I or EXT_T. The variable name is put in the string (EXT_I) or in the string table (EXT_T index marks the position in the table). Besides the three different versions in each extension (0, 1 and 2) are used to specify the type of the variable (escape, un-escape or no-escape). The idea behind that (I suppose) is saving in document length (using EXT_T is better for that purpose).
Previously the wbxml-stream library just manages EXT_T_0 extensions and because they were used by the WV (Wireless Village). This language defines some tokens to change arbitrary strings into EXT_T_0 extensions (for example "https://" is defined as the extension token 0x0f) in order to get shorter values and attributes. The C library libwbxml integrates this definition as a common WBXML feature (the extension tokens are defined in a general structure), so, because I copied a lot of ideas from the C implementation, I did exactly the same. Now it is clear that this is a specific feature of the WV-CSP language and no a general WBXML one. But I am going to leave things like this, any language can still define extension strings that will be replaced for EXT_T_0 tokens and the encoding is done in WbXmlEncoder core class.
What it is different now, in the new version, is that there are three new interfaces to define what the language does when any extension token is found. There are three because one interface is defined for each type of extension (EXT_I receives the string, EXT_T the long and EXT receives nothing).
public interface ExtensionIPlugin {
public WbXmlContent parseContent(WbXmlParser parser, String tagName, byte ext, String value) throws IOException;
public String parseAttribute(WbXmlParser parser, String attrName, byte ext, String value) throws IOException;
}
public interface ExtensionTPlugin {
public WbXmlContent parseContent(WbXmlParser parser, String tagName, byte ext, long value) throws IOException;
public String parseAttribute(WbXmlParser parser, String attrName, byte ext, long value) throws IOException;
}
public interface ExtensionPlugin {
public WbXmlContent parseContent(WbXmlParser parser, String tagName, byte ext) throws IOException;
public String parseAttribute(WbXmlParser parser, String attrName, byte ext) throws IOException;
}
I think it is more or less clear what is the idea. For example in WML if you find a EXT_I_0 token with the string "var", it should be replaced for a variable called var of type escape, that is "$(var:escape)". So the WML language adds a class that implements both ExtensionIPlugin and extensionTPlugin. Besides there are new property names to specify the classes that will handle those extensions. For example in the WML language the EXT_I and EXT_T options are used to mark the class that handles them:
wbxml.extension.EXT_I=es.rickyepoderi.wbxml.document.opaque.WMLVariableExtension wbxml.extension.EXT_T=es.rickyepoderi.wbxml.document.opaque.WMLVariableExtension
The wbxml-stream implementation when finds a extension token and no class is defined to handle it just throws an exception.
Finally, with all the stuff shown before we know how to parse/decode a WBXML document that uses extensions but... How do we encode them? I mean, how is a variable in WML written using an extension instead of a normal string? In WV-CSP is done in the core but... How to do it normally? I just recommend to use an opaque plugin, I know it is quite strange but it is easier and I do not want to complicate even more the API for the moment. An opaque plugin was initially thought to replace any attribute value or tag element into an opaque (both ways, parsing and encoding), but you can re-use the interface to encode extensions too (the encode method is used and the parse will never be called because no opaque will be found). This way the WML plugin searches for variable strings and replaces the findings with EXT_I or EXT_T (depending the user of string table defined). I know it is quite weird but I think is the easiest way to manage it, maybe in future versions an API change is needed to reorganize extensions and opaques.
So, just to summarize how to habdle extensions in wbxml-stream library, I am going to list what WML and WV-CSP do (the only two languages that manage extensions):
WML gives one class that implements ExtensionIPlugin and ExtensionTPlugin and it is associated in the language definition. This way when parsing a WBXML document that class is called to handle any extension is found. The class transforms the extension in a proper variable in WML. But the encoding is doing differently, two opaque plugins (attribute and content) are used instead, the classes only perform the encoding part searching for variables in the string and replacing them for extensions. The parse method is not needed because it will never be called (because no opaque will be found, no one is written).
WV-CSP gives an ExtensionTPlugin (it searches for the corresponding string using the number in the EXT_T) but no opaque plugin. The library gives that feature (extension replacement using EXT_T_0) by default. As I explained before I copied libwbxml implementation when I had not understood completely WBXML extensions. I decided to not delete that feature (any language can do that) and, therefore, no opaque plugin is needed because the translation is done in the WbXmlEncoder class (but think about it as an exception, you normally will need and opaque plugin if you want to write extensions). For the same reason the ExtesionTPlugin that WV language uses is called DefaultExtTExtension instead of using a specific WV name.
And that's all. The wbxml-stream 0.4.0 is supposed to be feature complete, exactly at the same level of the counterpart libwbxml. The new languages are not so well tested because I did not find XML sample documents for them. I tried to, at least, have one XML and WBXML document for all languages but in some cases it is clearly insufficient. For that reason the library remains using versions 0.X.X. As usual, if you (for some strange circumstance) need to use a WBXML parser/encoding library in Java, please consider to use the wbxml-stream project and report any issue using github.
Regards!
Comments