Representing Sparse Data in PostgreSQL

Cerin picture Cerin · Apr 7, 2010 · Viewed 9.8k times · Source

What's the best way to represent a sparse data matrix in PostgreSQL? The two obvious methods I see are:

  1. Store data in a single a table with a separate column for every conceivable feature (potentially millions), but with a default value of NULL for unused features. This is conceptually very simple, but I know that with most RDMS implementations, that this is typically very inefficient, since the NULL values ususually takes up some space. However, I read an article (can't find its link unfortunately) that claimed PG doesn't take up data for NULL values, making it better suited for storing sparse data.

  2. Create separate "row" and "column" tables, as well as an intermediate table to link them and store the value for the column at that row. I believe this is the more traditional RDMS solution, but there's more complexity and overhead associated with it.

I also found PostgreDynamic, which claims to better support sparse data, but I don't want to switch my entire database server to a PG fork just for this feature.

Are there any other solutions? Which one should I use?

Answer

Tebas picture Tebas · Apr 28, 2011

I'm assuming you're thinking of sparse matrices from mathematical context: http://en.wikipedia.org/wiki/Sparse_matrix (The storing techniques described there are for memory storage (fast arithmetic operation), not persistent storage (low disk usage).)

Since one usually do operate on this matrices on client side rather than on server side a SQL-ARRAY[] is the best choice!

The question is how to take advantage of the sparsity of the matrix? Here the results from some investigations.

Setup:

  • Postgres 8.4
  • Matrices w/ 400*400 elements in double precision (8 Bytes) --> 1.28MiB raw size per matrix
  • 33% non-zero elements --> 427kiB effective size per matrix
  • averaged using ~1000 different random populated matrices

Competing methods:

  • Rely on the automatic server side compression of columns with SET STORAGE MAIN or EXTENDED.
  • Only store the non-zero elements plus a bitmap (bit varying(xx)) describing where to locate the non-zero elements in the matrix. (One double precision is 64 times bigger than one bit. In theory (ignoring overheads) this method should be an improvement if <=98% are non-zero ;-).) Server side compression is activated.
  • Replace the zeros in the matrix with NULL. (The RDBMSs are very effective in storing NULLs.) Server side compression is activated.

(Indexing of non-zero elements using a 2nd index-ARRAY[] is not very promising and therefor not tested.)

Results:

  • Automatic compression
    • no extra implementation efforts
    • no reduced network traffic
    • minimal compression overhead
    • persistent storage = 39% of the raw size
  • Bitmap
    • acceptable implementation effort
    • network traffic slightly decreased; dependent on sparsity
    • persistent storage = 33.9% of the raw size
  • Replace zeros with NULLs
    • some implementation effort (API needs to know where and how to set the NULLs in the ARRAY[] while constructing the INSERT query)
    • no change in network traffic
    • persistent storage = 35% of the raw size

Conclusion: Start with the EXTENDED/MAIN storage parameter. If you have some free time investigate your data and use my test setup with your sparsity level. But the effect may be lower than you expect.

I suggest always to use the matrix serialization (e.g. Row-major order) plus two integer columns for the matrix dimensions NxM. Since most APIs use textual SQL you are saving a lot of network traffic and client memory for nested "ARRAY[ARRAY[..], ARRAY[..], ARRAY[..], ARRAY[..], ..]" !!!

Tebas


CREATE TABLE _testschema.matrix_dense
(
  matdata double precision[]
);
ALTER TABLE _testschema.matrix_dense ALTER COLUMN matdata SET STORAGE EXTERN;


CREATE TABLE _testschema.matrix_sparse_autocompressed
(
  matdata double precision[]
);

CREATE TABLE _testschema.matrix_sparse_bitmap
(
  matdata double precision[]
  bitmap bit varying(8000000)
);

Insert the same matrices into all tables. The concrete data depends on the certain table. Do not change the data on server side due to unused but allocated pages. Or do a VACUUM.

SELECT 
pg_total_relation_size('_testschema.matrix_dense') AS dense, 
pg_total_relation_size('_testschema.matrix_sparse_autocompressed') AS autocompressed, 
pg_total_relation_size('_testschema.matrix_sparse_bitmap') AS bitmap;