Saturday, July 30. 2016
wbxml-stream: Version 0.4.0
Today another entry about the wbxml-stream project is presented. Although the project is not very important or used I wanted to stabilize it and give a version more or less complete. For that reason the version 0.3.0 implemented missing features in the StAX API and today's one offers more languages. My idea was supporting the same languages that the C counterpart libwbxml manages. So the new version 0.4.0 gives much more languages (WML, WV CSP 1.2, SYNCML 1.0,...) and, collaterally, an important new feature is covered.
The first idea for this version was just presenting more property files to support all the languages. Until now, wbxml-stream just supported the languages that were tested in the C counterpart (I mean, there were XML sample files to test the correct implementation). In theory no more features were going to be added but, as always, the WBML protocol is a box of never-ending surprises. One of the new languages is WML (Wireless Markup Language), and this language was used long ago to present pages in old phones. It uses common WBXML features but adds one particular detail (only for WML). WML uses variables (it is a little programming language) and the variables can be of three types: escape, un-escape and no-escape (please don't ask me what each type means). And a variable should be encoded into WBXML using EXT_I or EXT_T extensions. In general, the WBXML protocol defines three extensions: EXT_I, EXT_T and EXT. The definition in the specification is as follows:
extension = [switchPage] (( EXT_I termstr ) | ( EXT_T index ) | EXT) switchPage = SWTICH_PAGE pageindex pageindex = u_int8 termstr = charset-dependent string with termination (0x00) index = mb_u_int32 // integer index into string table.
And there is a chapter dedicated for them (5.8.4.2. Global Extension Tokens). It says that global extensions are available for document-specific use (the language can use them for whatever it wants). Each type has in turn three versions (EXT_I_[012], EXT_T_[012] and EXT_[012]) defined with a specific token number. The EXT_I extensions have a string associated, the EXT_T a long number (thought to be used with the string table) and the EXT ones just are the token itself (no data associated). In summary, with the commented characteristics, a language can use the global extensions for any usage. In my humble opinion it is better not to use any extension (same thing as for opaques), because they add custom features that need to be implemented in any library (I do not understand why WBXML is so fucking extensible).
WML uses EXT_I and EXT_T extensions to encode the variable names. Whenever you find a variable (in WML they are in the style of $var or $(var), and you can specify the type using $(var:escape) for example) it should be transformed into a EXT_I or EXT_T. The variable name is put in the string (EXT_I) or in the string table (EXT_T index marks the position in the table). Besides the three different versions in each extension (0, 1 and 2) are used to specify the type of the variable (escape, un-escape or no-escape). The idea behind that (I suppose) is saving in document length (using EXT_T is better for that purpose).
Previously the wbxml-stream library just manages EXT_T_0 extensions and because they were used by the WV (Wireless Village). This language defines some tokens to change arbitrary strings into EXT_T_0 extensions (for example "https://" is defined as the extension token 0x0f) in order to get shorter values and attributes. The C library libwbxml integrates this definition as a common WBXML feature (the extension tokens are defined in a general structure), so, because I copied a lot of ideas from the C implementation, I did exactly the same. Now it is clear that this is a specific feature of the WV-CSP language and no a general WBXML one. But I am going to leave things like this, any language can still define extension strings that will be replaced for EXT_T_0 tokens and the encoding is done in WbXmlEncoder core class.
What it is different now, in the new version, is that there are three new interfaces to define what the language does when any extension token is found. There are three because one interface is defined for each type of extension (EXT_I receives the string, EXT_T the long and EXT receives nothing).
public interface ExtensionIPlugin {
public WbXmlContent parseContent(WbXmlParser parser, String tagName, byte ext, String value) throws IOException;
public String parseAttribute(WbXmlParser parser, String attrName, byte ext, String value) throws IOException;
}
public interface ExtensionTPlugin {
public WbXmlContent parseContent(WbXmlParser parser, String tagName, byte ext, long value) throws IOException;
public String parseAttribute(WbXmlParser parser, String attrName, byte ext, long value) throws IOException;
}
public interface ExtensionPlugin {
public WbXmlContent parseContent(WbXmlParser parser, String tagName, byte ext) throws IOException;
public String parseAttribute(WbXmlParser parser, String attrName, byte ext) throws IOException;
}
I think it is more or less clear what is the idea. For example in WML if you find a EXT_I_0 token with the string "var", it should be replaced for a variable called var of type escape, that is "$(var:escape)". So the WML language adds a class that implements both ExtensionIPlugin and extensionTPlugin. Besides there are new property names to specify the classes that will handle those extensions. For example in the WML language the EXT_I and EXT_T options are used to mark the class that handles them:
wbxml.extension.EXT_I=es.rickyepoderi.wbxml.document.opaque.WMLVariableExtension wbxml.extension.EXT_T=es.rickyepoderi.wbxml.document.opaque.WMLVariableExtension
The wbxml-stream implementation when finds a extension token and no class is defined to handle it just throws an exception.
Finally, with all the stuff shown before we know how to parse/decode a WBXML document that uses extensions but... How do we encode them? I mean, how is a variable in WML written using an extension instead of a normal string? In WV-CSP is done in the core but... How to do it normally? I just recommend to use an opaque plugin, I know it is quite strange but it is easier and I do not want to complicate even more the API for the moment. An opaque plugin was initially thought to replace any attribute value or tag element into an opaque (both ways, parsing and encoding), but you can re-use the interface to encode extensions too (the encode method is used and the parse will never be called because no opaque will be found). This way the WML plugin searches for variable strings and replaces the findings with EXT_I or EXT_T (depending the user of string table defined). I know it is quite weird but I think is the easiest way to manage it, maybe in future versions an API change is needed to reorganize extensions and opaques.
So, just to summarize how to habdle extensions in wbxml-stream library, I am going to list what WML and WV-CSP do (the only two languages that manage extensions):
WML gives one class that implements ExtensionIPlugin and ExtensionTPlugin and it is associated in the language definition. This way when parsing a WBXML document that class is called to handle any extension is found. The class transforms the extension in a proper variable in WML. But the encoding is doing differently, two opaque plugins (attribute and content) are used instead, the classes only perform the encoding part searching for variables in the string and replacing them for extensions. The parse method is not needed because it will never be called (because no opaque will be found, no one is written).
WV-CSP gives an ExtensionTPlugin (it searches for the corresponding string using the number in the EXT_T) but no opaque plugin. The library gives that feature (extension replacement using EXT_T_0) by default. As I explained before I copied libwbxml implementation when I had not understood completely WBXML extensions. I decided to not delete that feature (any language can do that) and, therefore, no opaque plugin is needed because the translation is done in the WbXmlEncoder class (but think about it as an exception, you normally will need and opaque plugin if you want to write extensions). For the same reason the ExtesionTPlugin that WV language uses is called DefaultExtTExtension instead of using a specific WV name.
And that's all. The wbxml-stream 0.4.0 is supposed to be feature complete, exactly at the same level of the counterpart libwbxml. The new languages are not so well tested because I did not find XML sample documents for them. I tried to, at least, have one XML and WBXML document for all languages but in some cases it is clearly insufficient. For that reason the library remains using versions 0.X.X. As usual, if you (for some strange circumstance) need to use a WBXML parser/encoding library in Java, please consider to use the wbxml-stream project and report any issue using github.
Regards!
Monday, July 25. 2016
Moving to Nextcloud
Today I decided to upgrade from owncloud 8.2 to nextcloud 9.0. There are a lot of posts around the internet so I am not going to give a detailed step by step (besides it is similar to a migration to a newer version of owncloud, so my previous post is valid here too). If you remember I had problems updating to the las version because of errors in the database upgrade. The error continued up to now and in both applications (owncloud or nextcloud).
The problem complained about creating some tables. The final reason was that those tables were already there (I do not know where they came from, because they are in my backups since version 7 provided by Debian). Finally I gave up and decided to spend the time deleting the affected tables and re-inserting the data if necessary. I went table by table and finally the following SQL script drop the problematic tables:
/* drop database an re-create */
drop database owncloud;
create database owncloud;
use owncloud;
/* load the dump from version 8.2 */
source /home/ricky/owncloud-8.2.dump
/* drop tables that seem to be wrong */
drop table oc_notifications;
drop table oc_trusted_servers;
drop table oc_addressbooks;
drop table oc_cards;
drop table oc_addressbookchanges;
drop table oc_calendarobjects;
drop table oc_calendars;
drop table oc_calendarchanges;
drop table oc_calendarsubscriptions;
drop table oc_schedulingobjects;
drop table oc_cards_properties;
drop table oc_dav_shares;
From those tables five of them had previous data in the backup (oc_addressbooks, oc_cards, oc_addressbookchanges, oc_calendars and oc_cards_properties). Nevertheless after the upgrade all of them were fully filled again (so I think the data is created from pre-existing information and no more action was needed). Deleting them the nextcloud upgrade ran successfully and now I am at version 9.0.53. As I commented in the previous entry do not forget to activate the applications you use and upgrade them too. Just one more comment, for the notes application I use, the current master branch is needed (last released version 2.0.1 does not work, nothing is shown inside the page).
So, finally I decided to move from owncloud to nextcloud. I also changed the phone application. Let's see what happens because I am not very sure about my decision. (Mainly I wanted to fix my broken upgrade to version 9.x. And, because both projects gave me the same problem, I realized that the issue had to be related to my particular data and not to a problem in the upgrade itself. That fact forced me to choose a project sooner than I would have wanted. The answer from the owncloud foundation to the nextcloud announce was quite decisive.)
Enjoy and retain your data!
Saturday, July 9. 2016
wbxml-stream: Version 0.3.0
After the version 0.2.0 of the wbxml-stream library, I decided to continue working in the project for completing the StAX (Streaming API for XML) implementation for the WBXML format. StAX is quite complicated and not very clear in some parts (it has some flaws that make me think it was not created by Sun but by other member of the community; this is only my suspicion). Basically, for the new version I wanted to implement three changes.
How the ATTRIBUTE event should be used. The StAX framework, the XMLStreamReader more concretely, parses the XML file generating events (for example START_DOCUMENT when the document is just opened, START_ELEMENT when an XML tag is found or CHARACTERS when simple text is reached). Besides there are some methods associated with each event to recover the data associated to it (for example getText in CHARACTERS to retrieve the string value of the event, or getName in START_ELEMENT to know the name of the tag). The developer should iterate over the events returned by the reader and pick up the things he/she wants. There are another reader, the XMLEventReader which is mainly a wrapper of the previous one, that returns event objects that combines both (the event and the methods for retrieving the data associated to it).
One of those events defined in the API is the ATTRIBUTE. In my first implementation, that event was returned just one time for each tag that had one or more attributes (the methods to access the attributes are always indexed, so I thought that one time was enough). But when I switched to implement the event reader, the Attribute event class just represent one attribute (name and value). Something was wrong. At that time I decided to not return attribute events at all (the easiest solution at that time). It worked because the ATTRIBUTE event was not used in general by other parts of the JDK.
For version 0.3.0 I wanted to do it properly and I changed to return as many events as attributes were in the element. I thought that that was the proper implementation and it also worked well. But then I realized that I was wrong and my first implementation was the good one. The javadoc for the Attribute event class says the following: "Attributes are reported as a set of events accessible from a StartElement. Other applications may report Attributes as first-order events, for example as the results of an XPath expression."
So (for what I understand) the ATTRIBUTE event is never returned by the readers, it is only an auxiliary event that should be accessed from the START_ELEMENT. So I reverted back to my first implementation. But now thinking is the proper one.
I also wanted to re-write the WbXmlEventReader. In the previous versions I had implemented my own XMLEvent classes. But I later realized that there are two classes in the StAX API that are thought to be used in the event reader:
The XMLEventAllocator which is basically an object to convert the current state of a stream reader in the corresponding XMLEvent.
The XMLEventFactory which is a factory to construct events (and used by the allocator).
So I could avoid all my internal implementation and simply use an allocator to create the events. Other weird thing with the StAX interface is that you can obtain the default event factory (XMLEventFactory.newFactory method) but not the default allocator (there is no method to get the default allocator of an implementation). So I needed to implement my own allocator copying the JDK default class.
Now the WbXmlEventReader uses an WbXmlEventAllocator to parse the internal stream reader state and convert it into event objects. You can assign other allocator using the standard methods defined in the factory.
The main feature added in version 0.3.0 is the filtered readers. The StAX API is unnecessarily complicated and it permits to create filters over the two types of readers (stream and event). The idea is simple, you can add a filter that only accepts some types of events and, this way, the methods to iterate the events just avoid all the rest of events not accepted by the filter (the methods affected should be next, hasNext and peek). The idea seems interesting but is quite difficult to implement and (I think) not used very much. Why complicating an API like this? In the previous versions those filtered readers are simply not implemented.
With the target of completing the library I decided long time ago to implement those readers too. I used different ways for the two types of readers:
The basic stream reader was only modified to permit to save the current state of the parsing (backup) and restore it later (restore). This way you can save a point in the reader to move back to the same position later.
With that modification a wrapper class was made. The filtered stream reader just manages the next position using the backup/restore system (the next position is calculated with the following event that is accepted by the filter). I did not want to complicate more the basic implementation and this idea was simple and functional.
The event reader uses internally the basic stream reader. In this case I complicated a bit the event reader implementation (using again backup/restore methods) to get the next accepted position accepted by the filter. So, more or less, the event reader uses the same idea than the filtered stream reader (if it worked once...).
Now the wbxml-stream library lets you use a filtered reader (stream or event) to read a WBXML document. Here I present one of the unit-tests inside the library that checks the correct implementation with a SI document. Imagine you have to parse the a SI XML document but in WBXML format:
<?xml version="1.0"?> <!DOCTYPE si PUBLIC "-//WAPFORUM//DTD SI 1.0//EN" "http://www.wapforum.org/DTD/si.dtd"> <si> <indication href="http://www.xyz.com/email/123/abc.wml" created="1999-06-25T15:23:15Z" si-expires="1999-06-30T00:00:00Z"> You have 4 new emails </indication> </si>
You can think about a StreamFilter that only manages the START_ELEMENT events, using a filter like that the previous document could be read this way:
XMLInputFactory f = new WbXmlInputFactory(); XMLStreamReader reader = f.createXMLStreamReader(new FileInputStream("si-001.wbxml")); StreamFilter filter = new StreamFilter() { @Override public boolean accept(XMLStreamReader reader) { return reader.getEventType() == XMLStreamConstants.START_ELEMENT; } }; reader = f.createFilteredReader(reader, filter); // it should be the start document Assert.assertEquals(XMLStreamConstants.START_DOCUMENT, reader.getEventType()); // read next => it should be "si" Assert.assertEquals(XMLStreamConstants.START_ELEMENT, reader.next()); Assert.assertEquals("si", reader.getName().getLocalPart()); Assert.assertEquals(0, reader.getAttributeCount()); // next it should be "indication" Assert.assertEquals(XMLStreamConstants.START_ELEMENT, reader.next()); Assert.assertEquals("indication", reader.getName().getLocalPart()); Assert.assertEquals(3, reader.getAttributeCount()); Assert.assertEquals("1999-06-25T15:23:15Z", reader.getAttributeValue(null, "created")); Assert.assertEquals("http://www.xyz.com/email/123/abc.wml", reader.getAttributeValue(null, "href")); Assert.assertEquals("1999-06-30T00:00:00Z", reader.getAttributeValue(null, "si-expires")); Assert.assertEquals("You have 4 new emails", reader.getElementText()); Assert.assertEquals(XMLStreamConstants.END_ELEMENT, reader.getEventType()); // no more start elements Assert.assertFalse(reader.hasNext());
And that is all. The new version 0.3.0 of wbxml-stream adds the filtered readers to the implementation as its main new feature. It also improves the WbXmlEventReader in several aspects and fixes. If you are currently using this library please try the new version downloading it from here. As always any problem or enhancement could be reported in github. Mi personal feeling about StAX API is changing day by day. Now I think that this API is ridiculously complicated. It has a lot of ways of doing exactly the same with no great benefit (stream, event, filters,...), sometimes it is not very clear (the ATTRIBUTE event is an example) and it has some flaws (why I cannot obtain the default XMLEventAllocator?). I do not know, maybe I am not understanding something, but with the XmlStreamReader class you have more than enough to read any XML document (it is also the fastest way), all the rest is just cumbersome stuff.
Regards!
Comments