What are the options for storing hierarchical data in a relational database?

Question 1

What are the options for storing hierarchical data in a relational database?

sql database tree relational-database hierarchical-data

orangepips · Oct 29, 2010 · Viewed 257k times · Source

Answer

Answer

My favorite answer is as what the first sentence in this thread suggested. Use an Adjacency List to maintain the hierarchy and use Nested Sets to query the hierarchy.

The problem up until now has been that the coversion method from an Adjacecy List to Nested Sets has been frightfully slow because most people use the extreme RBAR method known as a "Push Stack" to do the conversion and has been considered to be way to expensive to reach the Nirvana of the simplicity of maintenance by the Adjacency List and the awesome performance of Nested Sets. As a result, most people end up having to settle for one or the other especially if there are more than, say, a lousy 100,000 nodes or so. Using the push stack method can take a whole day to do the conversion on what MLM'ers would consider to be a small million node hierarchy.

I thought I'd give Celko a bit of competition by coming up with a method to convert an Adjacency List to Nested sets at speeds that just seem impossible. Here's the performance of the push stack method on my i5 laptop.

Duration for     1,000 Nodes = 00:00:00:870 
Duration for    10,000 Nodes = 00:01:01:783 (70 times slower instead of just 10)
Duration for   100,000 Nodes = 00:49:59:730 (3,446 times slower instead of just 100) 
Duration for 1,000,000 Nodes = 'Didn't even try this'

And here's the duration for the new method (with the push stack method in parenthesis).

Duration for     1,000 Nodes = 00:00:00:053 (compared to 00:00:00:870)
Duration for    10,000 Nodes = 00:00:00:323 (compared to 00:01:01:783)
Duration for   100,000 Nodes = 00:00:03:867 (compared to 00:49:59:730)
Duration for 1,000,000 Nodes = 00:00:54:283 (compared to something like 2 days!!!)

Yes, that's correct. 1 million nodes converted in less than a minute and 100,000 nodes in under 4 seconds.

You can read about the new method and get a copy of the code at the following URL. http://www.sqlservercentral.com/articles/Hierarchy/94040/

I also developed a "pre-aggregated" hierarchy using similar methods. MLM'ers and people making bills of materials will be particularly interested in this article. http://www.sqlservercentral.com/articles/T-SQL/94570/

If you do stop by to take a look at either article, jump into the "Join the discussion" link and let me know what you think.

Question 2

Good Overviews

Generally speaking, you're making a decision between fast read times (for example, nested set) or fast write times (adjacency list). Usually, you end up with a combination of the options below that best fit your needs. The following provides some in-depth reading:

One more Nested Intervals vs. Adjacency List comparison: the best comparison of Adjacency List, Materialized Path, Nested Set and Nested Interval I've found.
Models for hierarchical data: slides with good explanations of tradeoffs and example usage
Representing hierarchies in MySQL: very good overview of Nested Set in particular
Hierarchical data in RDBMSs: most comprehensive and well-organized set of links I've seen, but not much in the way of explanation

Options

Ones I am aware of and general features:

Adjacency List:
- Columns: ID, ParentID
- Easy to implement.
- Cheap node moves, inserts, and deletes.
- Expensive to find the level, ancestry & descendants, path
- Avoid N+1 via Common Table Expressions in databases that support them
Nested Set (a.k.a Modified Preorder Tree Traversal)
- Columns: Left, Right
- Cheap ancestry, descendants
- Very expensive O(n/2) moves, inserts, deletes due to volatile encoding
Bridge Table (a.k.a. Closure Table /w triggers)
- Uses separate join table with: ancestor, descendant, depth (optional)
- Cheap ancestry and descendants
- Writes costs O(log n) (size of subtree) for insert, updates, deletes
- Normalized encoding: good for RDBMS statistics & query planner in joins
- Requires multiple rows per node
Lineage Column (a.k.a. Materialized Path, Path Enumeration)
- Column: lineage (e.g. /parent/child/grandchild/etc...)
- Cheap descendants via prefix query (e.g. LEFT(lineage, #) = '/enumerated/path')
- Writes costs O(log n) (size of subtree) for insert, updates, deletes
- Non-relational: relies on Array datatype or serialized string format
Nested Intervals
- Like nested set, but with real/float/decimal so that the encoding isn't volatile (inexpensive move/insert/delete)
- Has real/float/decimal representation/precision issues
- Matrix encoding variant adds ancestor encoding (materialized path) for "free", but with added trickiness of linear algebra.
Flat Table
- A modified Adjacency List that adds a Level and Rank (e.g. ordering) column to each record.
- Cheap to iterate/paginate over
- Expensive move and delete
- Good Use: threaded discussion - forums / blog comments
Multiple lineage columns
- Columns: one for each lineage level, refers to all the parents up to the root, levels down from the item's level are set to NULL
- Cheap ancestors, descendants, level
- Cheap insert, delete, move of the leaves
- Expensive insert, delete, move of the internal nodes
- Hard limit to how deep the hierarchy can be

Database Specific Notes

MySQL

Use session variables for Adjacency List

Oracle

Use CONNECT BY to traverse Adjacency Lists

PostgreSQL

ltree datatype for Materialized Path

SQL Server

General summary
2008 offers HierarchyId data type appears to help with Lineage Column approach and expand the depth that can be represented.

What are the options for storing hierarchical data in a relational database?

Answer

Related questions