When to replace RDBMS/ORM with NoSQL

jgauffin picture jgauffin · Aug 19, 2010 · Viewed 8.3k times · Source

What kind of projects benefit from using a NoSQL database instead of rdbms wrapped by an ORM?

Examples:

  • Stackoverflow similiar sites?
  • Social communities?
  • forums?

Answer

Niels van der Rest picture Niels van der Rest · Aug 19, 2010

Your question is very general. NoSQL describes a collection of database techniques that are very different from each other. Roughly, there are:

  • Key-value stores (Redis, Riak)
  • Triplestores (AllegroGraph)
  • Column-family stores (Bigtable, Cassandra)
  • Document-oriented stores (CouchDB, MongoDB)
  • Graph databases (Neo4j)

A project can benefit from the use of a document database during the development phase of the project, because you won't have to design complex entity-relation diagrams or write complex join queries. I've detailed other uses of document databases in this answer.

If your application needs to handle very large amounts of data, the development phase will likely be longer when you use a specialized NoSQL solution such as Cassandra. However, when your application goes into production, it will greatly benefit from the performance and scalability of Cassandra.

Very generally speaking, if an application has the following requirements:

  • scale horizontally
  • work with data model X
  • perform Y operations

the application will benefit from using a NoSQL solution that is geared towards storing data model X and perform Y operations on the data. If you need more specific answers regarding a certain type of NoSQL database, you'll need to update your question.

  1. Benefits during development (e.g. easier to use than SQL, no licensing costs)?
  2. Benefits in terms of performance (e.g. runs like hell with a million concurrent users)?
  3. What type of NoSQL database?

Update

Key-value stores can only be queried by key in most cases. They're useful to store simple data, such as user sessions, simple profile data or precomputed values and output. Although it is possible to store more complex data in key-value pairs, it burdens the application with the responsibility of maintaining 'manual' indexes in order to perform more advanced queries.

Triplestores are for storing Resource Description Metadata. I don't anything about these stores, except for what Wikipedia tells me, so you'll have to do some research on that.

Column-family stores are built for storing and processing very large amounts of data. They are used by Google's search engine and Facebook's inbox search. The data is queried by MapReduce functions. Although MapReduce functions may be hard to grasp in the beginning, the concept is quite simple. Here's an analogy which (hopefully) explains the concept:

Imagine you have multiple shoe-boxes filled with receipts, and you want to calculate your total expenses. You invite some of your friends over and assign a person to each shoe-box. Each person writes down the total of each receipt in his shoe-box. This process of selecting the required data is the Map part.

When a person has written down the totals of (some of) his receipts, he can sum up these totals. This is the Reduce part and can be repeated multiple times until all receipts have been handled. In the end, all of your friends come together and sum up their total sums, giving you your total expenses. That's the final Reduce step.

The advantage of this approach is that you can have any number of shoe-boxes and you can assign any number of people to a shoe-box and still end up with the same result. Each shoe-box can be seen as a server in the database's network. Each friend can be seem as a thread on the server. With MapReduce you can have your data distributed across many servers and have each server handle part of the query, optimizing the performance of your database.

Document-oriented stores are explained in this question, so I won't discuss them here.

Graph databases are for storing networks of highly connected objects, like the users on a social network for example. These databases are optimized for graph operations, such as finding the shortest path between two nodes, or finding all nodes within three hops from the current node. Such operations are quite expensive on RDBMS systems or other NoSQL databases, but very cheap on graph databases.