Relational and Non-relational databases

In old days, when we used to think of 'database', we would only picture tables and columns, primary and foreign keys, complex joins, intersection tables etc. We often use ER diagrams, relationships to design and define such models. These databases are widely used and are stable. They adhere to ACID principles, provide immediate consistency which are most important when it comes to transaction handling. The most widely used products are Oracle, Teradata, SAP Hana, IBM db2, MySQL, Amazon Aurora, Maria db (comes by default on linux distributions), postgreSQL.
Then why is there a need of non-relational database?

The reason is Data itself. Data is now considered most valuable resource, even more than Oil!! No wonder that social networking site like Facebook and twitter actually run the world. They decide the trends, what people read, what they think. All from the power of Data. Anyways that’s a topic for another post. Data from web, genomics data, data from IoT devices, it’s just vast. Companies also want to utilize external data like customer clickstream data, customer transactions from IoT devices, interactions from social media. Good thing is, storage costs are getting cheaper and we can afford to store data and analyse it.
If we decide to store such data in traditional relational data warehouses, will it be possible? Oh yes, why not?! But we will need a big team of DBAs and developers, to tune and to distribute the data in the relational database. We could add more CPUs, more discs, more memory  on our database servers to scale up and continue getting acceptable performance as data increases day by day. And it all sounds very costly considering the hardware costs and we will still not get the performance because we are trying to force unstructured data into rows and columns.

There is a massive category of non-relational databases to handle this exact issue. Let’s see some features:
As non-relational databases are distributed to provide faster performance, we might not get immediate consistency as in relational dbs.
When we start development, we need not spend months designing the schema of our database. Schema on read is an important feature of non-relational dbs.
Relational dbs can be accessed via ODBC, JDBC connections, whereas non-relational dbs can be accessed using APIs and also with SQLs.
Auto sharding is another feature, where developers need not take care of tuning the database or worry about hardware requirements if it’s a cloud service. E.g. for DynamoDB we only provide RCUs and WCUs (read and write capacity units) . It will automatically calculate the number of partitions needed based on the capacity we want.
Availability and durability is achieved from replication. Amazon S3 offers 99 point 11 times 9 durability by replicating the data across different availability zones.

How do design considerations differ when we design relational or non-relational databases? While designing relational database, Data is the only consideration. We define data entities, foreign key relationships, if the relations are 1:M or M:M, columns etc.
Whereas for non-relational structures, query access pattern  is the most important consideration. We will design based on the most frequently accessed queries or how user is going to use the data. We need to know this information in advance before we can design non-relational database model.

In terms of generic use cases, relational data warehouses still hold good while designing a data warehouse for analytical purposes. For Operational data store, where users are constantly retrieving data from millions of rows and expecting millisecond responses, a non-relational data store could be better. Such ODS will also reduce burden from relational data warehouse which could be used for transaction processing and to perform aggregate complex queries for reporting.

Let’s see types of non-relational databases now.
Key value store: Are the most simple non-relational data stores. And hence they are faster too. One use case is for clickstream analysis when we need to respond quickly to a trending video on YouTube! (and increase the advertisement cost, till it lasts!)
Some key value data stores are stored in memory than on disc to reduce latency. E.g. Amazon Redis, Amazon Aerospike
Document database: mongoDB (stores data in unstructured JSON blob), Amazon dynamoDB
Wide column store: Refer to this article to understand the concept.
Names and format of the columns in a wide column store can vary from row to row in the same table. A wide column store can be interpreted as a two-dimensional key-value store. Examples are Apache Hbase, Cassandra, Google Bigtable
Cloud data warehouse: These are relational data warehouses, which provide more flexibility and ease of implementation compared to on premise warehouses. E.g. Amazon Redshift, Google big query, Azure sql data warehouse, snowflake
Graph database: There are 2 types. Triples (2 nodes and relationship) and Compact, where properties are attached to the node itself. E.g. Neo4j, Titan
Search data store: Provides Google like search functionality. E.g. Elastic search, which searches any type of data, abbreviations. Incorporates machine learning and AI. Splunk, used for searching enterprise logs
Time series database: is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range) e.g. eXtremeDB, Informix Timeseries

HDFS: Part of Hadoop ecosystem. Drawback is, it can only be used for batch processing and not for Transaction handling. E.g. MapR, Cloudera, Hortonworks, Amazon Elastic Map Reduce, Microsoft Azure HDinsight
Block or object store: It is cheap and durable. Used to store images, files. E.g. Amazon S3

When there are so many technologies and products available, we get freedom to choose whats best for our use case and implement polygot persistence. No solution is perfect, we have to make it perfect for us :)

Comments

Popular posts from this blog

Two pizza team!

Data governance and (hence) Metadata Management

Understanding Wide-Column Stores