Introduction to Apache Cassandra and support within WSO2 Platform

I did the following presentation at the WSO2 Data workshop last week at San Jose. Slides discuss integration with WSO2 Platform and Apache Cassandra. However, I think, it is more useful as an introduction to Cassandra. Here I am trying to discusses their column based model and explain how different it is from Columns in the Relational model. We (WSO2) have been working on integrating Cassandra with our platform. What we mean by integrating is that 

  1. We make WSO2 security model with Cassandra. Or simply put, anyone can login to Cassandra using username and password issued by WSO2 Identity Server (IS) and access Cassandra resources (key spaces, column families etc.) according to the permissions defined in WSO2 IS). 
  2. We support Multi-tenancy.  That is anyone can come to WSO2 Stratos and have their own Cassandra server and use that. You can say we offer Cassandra as a service. 

You can find the presentation from Introduction to Apache Cassandra and support within WSO2 Platform

A Multi-tenant Architecture for Business Process Executions at ICWS 2011

Following is the slides for the presentation A Multi-tenant Architecture for Business Process Executions, which I did at ICWS (International Conference of Web Services) 2011. In this work, we have extended Apache ODE project to to support Multi-tenancy.

Multi-tenancy enables a single server to support multiple tenants (users or organizations ) while providing the illusion that each tenant has its own server. It is one of the enabling technology for the Cloud, and Multi-tenancy based Cloud platforms can provides much higher resource utilization compared to virtualization based Cloud platforms.

This is the third paper on Multi-tenancy work we have been doing for more than 2 years. Details about other two can be found in here.

This work is available with WSO2 BPS, and also it is one of the key enabling technology for WSO2 Stratos Platform as a Service offering.

Cassandra: Surprises you Might Get

Now do not get me wrong, Cassandra is a great tool, and we use it. Following are few things that you might assume to be the case and only figure out later after you have been using it for some time. So spelling them out in case it is useful for someone else as well. 

  1. No transactions, no JOINs. Hope there is no surprise here. If it does, go and read bit about NoSQL before touching Cassandra. 
  2. No foreign keys and keys are immutable. (well no JOINs, and use surrogate keys if you need to change keys). You can change the keys, but any reference will not change, and there are no foreign key based integrity checks etc. 
  3. Keys has to be unique (use composite keys to work around this one). 
  4. Super Columns and order preserving partitioner are discouraged. – Developers repeats that life will be much easier without them. 
  5. Searching is complicated – No Search coming from the core. Either you have to use secondary indexes or create indexes yourself. Secondary indexes are layered on top, and not part of the main architecture. Also, they do not do range search or pattern search, which mean they good enough for extract string retrievals. So they are not good enough for , = etc. (only =) and SQL LIKE searches.    When secondary indexes does not work for what you need, you have to learn and build your indexes using sort orders and slices. 
  6. Sort orders are complicated – Column are always sorted by name, but row order depends on the partitioner. If you use sort orders to build your indexes, you have to worry about this. 
  7. Failed Operations may  leave changes – If operation is successful, all is well. However, if it failed, actually changes may have been applied. But operations are idempotent, so you can retry until successful. 
  8. Batch operations are not atomic, but you can retry until successful (as operations are idempotent). 
  9. If a node fails, Cassandra does not figure it out and and do a self haling. Assuming you have replica, things will continue to work. But the whole system recovers only when a manual recovery operation is done. 
  10. It remember deletes – When we delete a data item, a node may be down at the time and may come back after the delete is done. To avoid this, Cassandra mark them as deleted (Tombstones) but does not delete this until configurable timeout or a repair. Space is actually freed up only then. 

Netflix architecture talk at CasandraSF 2011

The talk is done by Adrian cockcroft.

Netflix has about 23 million subscribers and gets about 20B requests per month. Runs almost completely  in the cloud. Following are some reasons for using the Cloud.

  • Frictionless deployment  – do not need to order and wait. Remove many types of waits. 
  • Better business agility – that is can run the whole architecture for new zone within few hours. So they can act very fast. 
  • Cannot build data center fast enough to cater for the demand – data center takes at least 9 months to build. 
  • Support for  Zones, scale on demand, global deployment 

They keep movies separate, and distribute them through a CDN. Keep most of dynamic (transaction) data in Cassandra. Do not use EBS, but they back up data in Cassandra nodes periodically. Take independent backups at each Cassandra node, with loose synchronization. If there is any inconsistencies across different node backups, Cassandra repairs it at start up. They also  incrementally backup data to S3 to avoid losing data between backups. (presumably through callbacks that are called when SSTables in Cassandra are updated.)

They can point their system to a S3 bucket and bootup the system, and initializer will load and initialize the system using the data in that S3 bucket. They use Chaos monkey testing. In other words, a process goes and randomly kills nodes, but the system recovers. They do this often to make sure that recovery does work.