Using COLLECT STATISTICS in Teradata

ChrisCamp picture ChrisCamp · May 21, 2013 · Viewed 32.4k times · Source

In Teradata I can use a statement like ...

collect statistics on my_table column(col1)

This will gather stats on the table and store them in DBC views like ColumnStats, IndexStats and MultiColumnStats. I'm also under the impression that the optimizer (parsing engine) will find the statistics when they are available and use them instead of estimated table cardinality/index value counts to make better decisions on how to execute a query.

This all sounds great, but I have some questions.

  • are there any disadvantages to using collect stats?
  • When is it appropriate/inappropriate to use collect statistics in your SQL scripting?
  • What's the performance benefit to collect statistics on a field that's already indexed?
  • How long are statistics stored for (table, volatile tables)?
  • Any other comments concerning collect statistics would be appreciated.

Answer

gyan picture gyan · May 28, 2013

1>are there any disadvantages to using collect stats?

Yes, collect stats itself is time consuming, it actually locate data from AMPS and insert the stats in dictionary tables.

Suppose you have a table definition like:

ct t1(x1 int,y1 int, z1 int);

The table contains millions of rows and z1 is never used in the ST/Join conditions, then it is not worth to collect stats on z1.

2>When is it appropriate/inappropriate to use collect statistics in your SQL scripting?

Already answered above. If a column is going to be used as ST/Join condition .i.e in where or on clause, you must collect stats, otherwise not needed.

3>What's the performance benefit to collect statistics on a field that's already indexed?

ct t1(x1 int,y1 int) primary index(x1);

for a simple query like sel * from t1 where x1 = 5;

will demonstrate the usefulness of collect stats.

How?

the optimizer can correctly estimate how many rows this query will select and if t1 is going to be joined with say t2, a efficient join will be chosen by optimizer.

4>How long are statistics stored for (table, volatile tables)?

Table : permanently.

volatile tables: till session expires.

5>Any other comments concerning collect statistics would be appreciated.

Nothing has been discussed about multicolumn stats.

Say, the query is like:

sel * from t1 join t2 on y1=y2 and x1=2;

then collecting multi-column stats on (x1,y1) would be quite helpful in optimization.

Also, if table demography has been changed (increased number of rows) you must consider re-collecting the stats