DISTRIBUTE BY notices in Greenplum

Amelio Vazquez-Reina picture Amelio Vazquez-Reina · Feb 7, 2014 · Viewed 15.7k times · Source

Say I run the following query on psql:

> select a.c1, b.c2 into temp_table from db.A as a inner join db.B as b 
> on a.x = b.x limit 10;

I get the following message:

NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'c1' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.

  1. What is a DISTRIBUTED BY column?
  2. Where is temp_table stored? Is it stored on my client or on the server?

Answer

Wes Reing picture Wes Reing · Feb 7, 2014
  1. DISTRIBUTED BY is how Greenplum determines which segment will store each row. Because Greenplum is an MPP database in most production databases you will have multiple segment servers. You want to make sure that the Distribution column is the column you will join on usaly.

  2. temp_table is a table that will be created for you on the Greenplum cluster. If you haven't set search_path to something else it will be in the public schema.