September 10, 2014

A Short History of Databases: From RDBMS to NoSQL & Beyond

More data than ever before is being created, distributed and harnessed to make business decisions. In 2013, just to give you an idea how much more data than ever before, IBM said that 90% of the world’s data had been created in the last 2 years alone.

In this blog post, we will take a look at the evolution of databases and some of the reasons that relational databases are becoming less and less common and NoSQL databases are growing in popularity. We’ll look at advantages and disadvantages of relational databases and NoSQL, the differences between the two and cover the 4 different types of NoSQL databases that are used.

The boom in unstructured data that the world has seen in the last few years is one of the main reasons relational databases are no longer sufficient for many companies’ needs. One reason we have seen this boom in unstructured data is the global ease of access to the Internet. Contributing to this boom is the ubiquity of social media, wherein everybody wants to let others know happenings related to them as and when they are taking place. As more than 1/5th of the population is following such behavioral patterns, we can see that not only will data storage and fetching requirements become hugely important but simultaneously this also requires increased storage for various types of data like audio, video, images and textual data.

Traditionally we have been dependent upon the relational database management systems (RDBMS) for handling storage requirements in the IT World. Enormous amounts of data are created every day on the web via web and business applications and a large section of this data is handled by relational databases. Beyond a lot of intended benefits, the relational model is well-suited to client-server programming and today it is the predominant technology for storing structured data in web and business applications. Classical relational databases follow the ACID property. That is, a database transaction must be Atomic, Consistent, Isolated and Durable. The details of ACID are as follows:

  • Atomic: Atomicity requires that each transaction is “all or nothing.” If one part of the transaction fails, the entire transaction fails and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation.
  • Consistency: The consistency property ensures that the database remains in a consistent state before the start of the transaction and after the transaction is over (whether successful or not).
  • Isolation: Modifications of data performed by a transaction must be independent of another transaction.
  • Durability: Durability refers to the guarantee that once the user has been notified of success, the transaction will persist and not be undone.

Apart from these ACID properties, there are some basic characteristics due to which Relational DBMS become popular. Some of them are:

  • Data is stored in a set of Tables or data is stored in the format of row and column in a table.
  • Relationships are represented by data.
  • Tables are joined by relational links.
  • Reduced duplication of data in database can be achieved by normalization.
  • They allow greater flexibility and efficiency

Shortcomings of RDBMS

RDBMS is sufficient to store and manipulate all the structured data efficiently but in today’s world the velocity and nature of data used/generated over the Internet is growing exponentially. As we can often see in areas like social media, the data used has no specific structure boundary. This makes unavoidable the need to handle unstructured data which is non-relational and schema-less in nature. For RDBMS it becomes a real challenge to provide the cost effective and fast Create, Read, Update and Delete (CRUD) operation as it has to deal with the overhead of joins and maintaining relationships amongst various data.
Therefore a new mechanism is required to deal with such data in an easy and efficient way. This is where NoSQL comes into the picture to handle unstructured BIG data in an efficient way to provide maximum business value and customer satisfaction.

NoSQL

NoSQL is not a campaign against the SQL language. NoSQL stands for “Not Only SQL.” It provides more possibilities beyond the classic relational approach of data persistence to the developers.
NoSQL refers to a broad class of non-relational databases that differ from classical RDBMS in some significant aspects, most notably because they do not use SQL as their primary query language, instead providing access by means of Application Programming Interfaces (APIs).
The reason behind such a big switch or in other words the advantages of NoSQL are the following:

  • High scalability
  • Distributed Computing
  • Lower cost
  • Schema flexibility
  • Un/semi-structured data
  • No complex relationships

As RDBMS follows the ACID property, NoSQL databases are “BASE” Systems. The BASE acronym was defined by Eric Brewer, who is also known for formulating the CAP theorem whose properties are used by BASE System.

The CAP theorem states that a distributed computer system cannot guarantee all of the following three properties at the same time:

  • Consistency – once data is written, all future read requests will contain that data
  • Availability – the database is always available and responsive
  • Partition tolerance – if one part of the database is unavailable, other parts are unaffected

Brewer originally described this impossibility result as forcing a choice of “two out of the three” CAP properties, leaving three viable design options: CPAP and CA. All the three combinations can be defined as:

  • CA – data should be consistent between all nodes. As long as all nodes are online, users can read/write from any node and be sure that the data is the same on all nodes.
  • CP – data is consistent between all nodes and maintains partition tolerance by becoming unavailable when a node goes down.
  • AP – nodes remain online even if they can’t communicate with each other and will re-sync data once the partition is resolved, but you aren’t guaranteed that all nodes will have the same data (either during or after the partition)

BASE system gives up on consistency so as to have greater Availability and Partition tolerance. A BASE can be defined as following:

  • Basically Available indicates that the system does guarantee availability.
  • Soft state indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
  • Eventual consistency indicates that the system will become consistent over time, given that the system doesn’t receive input during that time.

Types of NoSQL

These days we are having about four types of NoSQL database available:

  • Key-Value: The main idea here is using a hash table where there is a unique key and a pointer to a particular item of data. The key-value model is the simplest and easiest to implement. But it is inefficient when you are only interested in querying or updating part of a value, among other disadvantages. Examples of key-value databases are Amazon simpleDB and Oracle BDB.
  • Column Oriented: These were created to store and process very large amounts of data distributed over many machines. There are still keys but they point to multiple columns. The columns are arranged by column family. Examples of column-oriented databases are Cassandra and HBase.
  • Document Stored: The model is basically versioned documents that are collections of other key-value collections. The semi-structured documents are stored in formats like JSON. Document databases are essentially the next level of key-value, allowing nested values associated with each key. Document databases support querying more efficiently. An example of a document stored database is MongoDB.
  • Graph Based: Instead of tables of rows and columns and the rigid structure of SQL, a flexible graph model is used which, again, can scale across multiple machines. NoSQL databases do not provide a high-level declarative query language like SQL to avoid overtime in processing. Rather, querying these databases is data-model specific.

Now we can easily differentiate between NoSQL and RDBMS:

  • NoSQL is free from joins and relationship while RDBMS use expensive join and relationships
  • NoSQL has a much lower maintenance cost compared to RDBMS
  • NoSQL increases the need for developers and database designers while RDBMS does not need much.
  • NoSQL uses the BASE while RDBMS uses ACID.
  • NoSQL guarantees AP whereas RDBMS does CA.

References:
A very nice video by Martin Fowler: