June 3, 2013

Big Data and CQL: Working With Semi-Structured Data in Cassandra


Looking back at databases’ technology evolution, we realize that SQL did such a good job in the industry that the mindset of application developers and technical managers is so locked in this paradigm that now it is quite hard to adopt “alternative” technologies. There is a tendency to impose reliable databases (“oldies but goodies”) used by the existing systems when a new software development project is coming along without considering the new context and challenges.

Let’s see why the “one size fits all” relational database choice is not easy to break in the persistence space:

  • Lack of standards in the NoSQL space. For so long there was just one query language and one way of modeling data. Now there are many flavours establishing their own standards
  • Documentation and tools in NoSQL need more polishing

However, let’s not forget that SQL has serious limitations when dealing with huge amounts of data, so here is the question: “How do you handle management of semi-structured data that keeps increasing in volume?”

A Semistructured Data Model

By semi-structured data, we understand here a relatively small number of stable entities that contain variable attributes (except the rowkey, which is mandatory). There are no strong relations among entities and no hierarchies either. In our case there are two basic entities: Event and EventProcessingResult, and there is One-To-One relationship based on eventId column but this relationship is not enforced so we might have entities on one side or the other without correspondent.

Figure 1 : Events Semi-Structured Data Model

Events Semi-Structured Data Model

One of the answers to the above question are columnar NoSQL databases like Cassandra and HBase. Both are mature and stable despite the fact they evolved from different philosophies (Cassandra was inspired from Amazon Dynamo, while HBase was inspired from Google BigTable).  Nevertheless Cassandra seems to have some more momentum these days, so let’s see what Cassandra brings to the table and how it responds to our question…

Cassandra Query Language is the equivalent of SQL with a very similar syntax and behavior for data definition and data manipulation. CQL can be embedded in code or used in command line so we’ll explore its capabilities in terms of basic data manipulation operations using cqlsh command.

Inspecting Existing Keyspaces

A keyspace is the equivalent of a database in relational databases.

cqlsh> describe keyspaces;

Inspecting a Keyspace (column families, data types, indexes)

A column family is the equivalent of tables in relational databases.

cqlsh> describe EventsSpace;
CREATE KEYSPACE "EventsSpace" WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': '3'
USE "EventsSpace";
  key bigint PRIMARY KEY,
  "categoryId" bigint,
  "customerId" bigint,
   description text,
  "eventId" bigint,
  “occuredDate” timestamp,
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};

Creating Indexes on Column Families

Cassandra manages two types of indexes : the primary index which is applied automatically on the row key (primary key), and the secondary indexes for the rest of columns which need to be added explicitly. It is worth mentioning that secondary indexes are efficient when the values range is significantly smaller than the row numbers so there is a good rate of data redundancy. Otherwise, for cases when the column values are almost all different it is recommended to create and maintain a separate index table.

cqlsh>create index Events_customerId_idx ON "Events" ("customerId");
cqlsh>create index Events_eventId_idx ON "Events" ("eventId");

Manipulating Data (select, insert, update, delete)

When executing selects Cassandra limits by default the results to 10.000 rows so if you need to see more or the entire result a limit must be added explicitly:

cqlsh> use EventsSpace;
cqlsh:EventsSpace> select count(*) from Events;
cqlsh:EventsSpace> select count(*) from Events limit 1000000;
cqlsh:EventsSpace> select categoryId from Events where customerId=1;

Insert and update operations have the same behavior, inserting with the same row key in this case eventId, will not add another record but will make changes on existing one :

cqlsh:EventsSpace> insert into Events (key, 12345) values ('myCategoryId', 'newCustomerId','description','newEventId');
cqlsh:EventsSpace> update Events set customerId = 7, description = ‘desc’ where key = ‘222’;

There is possible to apply changes in batch, still the atomicity of the batch execution is not guaranteed, the level of atomicity is ensured only at the row level – an operation on a row might either succeed or fail:

cqlsh:EventsSpace>begin batch
cqlsh:EventsSpace> insert into Events (key, 12346) values ('myCategoryId1', 'newCustomerId1','description','newEventId1');
cqlsh:EventsSpace> insert into Events (key, 12347) values ('myCategoryId2', 'newCustomerId2','description','newEventId2');
cqlsh:EventsSpace> insert into Events (key, 12348) values ('myCategoryId3', 'newCustomerId3','description','newEventId'3);
cqlsh:EventsSpace>apply batch

Delete works bidimensional by row and by column, if there are no columns specified the entire row will be deleted :
cqlsh: EventsSpace> delete description, categoryId from Events where key = ‘221’;
cqlsh: EventsSpace> delete description from Events where key = ‘222’;
cqlsh: EventsSpace> delete from Events where key = ‘224’;

NoSQL columnars are evolving their schema management capabilities by adopting most of the good ideas from the SQL space (even though in their own ways). Cassandra is no exception as CQL is gaining maturity and becoming easier to use and learn for someone who has relational background. Also documentation, tooling is getting better fast, if we look at the companies supporting NoSQL databases like DataStax.