All the last entries in the blog are dealing with my new idea about the couchbase-manager whose first version was released some days ago. I think the current status is good enough to release a first testing version. The main problem is that I have tested it only with a few real applications and that makes me extremely cautious about the quality. For the moment the last weekend I did another test I was eager to do: running the web services test application with an auto-failover three node couchbase cluster.
The demo environment is very similar to the commented in the full HA glassfish setup entry:
My wheezy laptop with another two KVM debian testing boxes. Each virtual box contains an Apache with mod_jk module and a glassfish 3.1.2 installation. The glassfish setup consists in the admin server and two clustered instances (one in each virtual machine). The Apache in the first box balances the two cluster instances non sticky, the second one does the same but sticky.
The three node couchbase cluster is also setup, one server in each virtual wheezy and the remaining one in my laptop. A two-node cluster seems not to be sufficient to activate the auto-failover feature which I want to test. The version I am using is 2.0.0 developer preview 4 (locking operations did not work in 1.8.0).
Inside the cluster the manager-test application is deployed to use the couchbase-manager in non-sticky configuration. The manager is configured to use the three couchbase servers.
So the idea is simple. I run a long multi-thread test that uses some sessions and, while everything is running, the couchbase server installed in my laptop is stopped. Theoretically some of the client threads will begin to fail (the ones which were using the crashed server) but, in 30 seconds, couchbase will fail over the downed node and re-balance the situation with the two remaining servers. All the clients should recover their sessions and run again without any problem. Let's see the video.
As you see the couchbase console shows a three server cluster. And immediately the client program is executed with the following options:
$ time java -cp . es.rickyepoderi.managertest.client.Test \
-b "http://192.168.122.21/manager-test/SessionTest?wsdl" \
-t 8 -ci 100 -ts 2000 -d
That means that eight threads (clients) will create a session and perform a update or a refresh every two seconds for a hundred times, at the end the session will be invalidated (that gives me 200 seconds to do the video ). Once all the threads are doing the operations the couchbase deployed in my laptop is stopped. In that moment some clients start to fail (for example child thread 18 or 19). The console marks the server as down and the 30 seconds auto-failover timeout is triggered. While the affected clients keep failing, the failover for the crashed server is automatically performed (it is marked in the console with the alarm icon). From that moment the failed clients recover the session from another server and they continue smoothly (the session is not created again, the same session is recovered!). The done! button in the console is clicked to acknowledge the situation and then the downed server is re-started. The console immediately informs about the new situation and, clicking manually the add back button, the cluster is reestablished. Once the three nodes are again back in the cluster the sessions are manually re-balanced inside the new cluster topology (rebalance button). In this point the only weird thing happens, some clients fail again (it is like in the re-balancing process some errors could happen). The exception given by the manager is the following:
java.lang.IllegalStateException: EXCEPTION
at es.rickyepoderi.couchbasemanager.session.CouchbaseManager.doSessionLoad(CouchbaseManager.java:715)
at es.rickyepoderi.couchbasemanager.session.CouchbaseWrapperSession.doLoad(CouchbaseWrapperSession.java:383)
at es.rickyepoderi.couchbasemanager.session.CouchbaseWrapperSession.lockForeground(CouchbaseWrapperSession.java:478)
at org.apache.catalina.connector.Request.lockSession(Request.java:4165)
at org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:312)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:231)
at com.sun.enterprise.v3.services.impl.ContainerMapper$AdapterCallable.call(ContainerMapper.java:317)
at com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:195)
at com.sun.grizzly.http.ProcessorTask.invokeAdapter(ProcessorTask.java:849)
at com.sun.grizzly.http.ajp.AjpProcessorTask.invokeAdapter(AjpProcessorTask.java:125)
at com.sun.grizzly.http.ProcessorTask.doProcess(ProcessorTask.java:746)
at com.sun.grizzly.http.ProcessorTask.process(ProcessorTask.java:1045)
at com.sun.grizzly.http.DefaultProtocolFilter.execute(DefaultProtocolFilter.java:228)
at com.sun.grizzly.DefaultProtocolChain.executeProtocolFilter(DefaultProtocolChain.java:137)
at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:104)
at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:90)
at com.sun.grizzly.http.HttpProtocolChain.execute(HttpProtocolChain.java:79)
at com.sun.grizzly.ProtocolChainContextTask.doCall(ProtocolChainContextTask.java:54)
at com.sun.grizzly.SelectionKeyContextTask.call(SelectionKeyContextTask.java:59)
at com.sun.grizzly.ContextTask.run(ContextTask.java:71)
at com.sun.grizzly.util.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:532)
at com.sun.grizzly.util.AbstractThreadPool$Worker.run(AbstractThreadPool.java:513)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Cancelled
at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:103)
at es.rickyepoderi.couchbasemanager.couchbase.ClientResult.createClientResultObject(ClientResult.java:103)
at es.rickyepoderi.couchbasemanager.couchbase.ClientRequest.waitForCompletion(ClientRequest.java:259)
at es.rickyepoderi.couchbasemanager.session.CouchbaseManager.doSessionLoad(CouchbaseManager.java:696)
... 22 more
Caused by: java.lang.RuntimeException: Cancelled
... 26 more
So the operation was canceled. Maybe some re-tries would be necessary (in case of a cancel) but that kind of things complicate so much the implementation that, for the moment, I am not going to do it (besides the re-try could also be performed by the couchbase/spymemcached client ). But it could be said that it is working and, more important, it is working well.
I love it when a plan comes together!
Comments