What's the difference between a catalog and a schema in a relational database?

database schema catalog

Stephan · Aug 11, 2011 · Viewed 86.9k times · Source

I used to think schema were the "upper wrapper" object before the database itself. I mean DB.schema.<what_ever_object_name_under_schema>.

Well, the catalog "wrapper" is now quite confusing. Why should we need a catalog? For what purpose, precisely should the catalog be used?

Answer

Mike Sherrill 'Cat Recall' gave an excellent answer. I'll add simply one example: Postgres.

Cluster = A Postgres Installation

When you install Postgres on a machine, that installation is called a cluster. ‘Cluster’ here is not meant in the hardware sense of multiple computers working together. In Postgres, cluster refers to the fact that you can have multiple unrelated databases all up and running using the same Postgres server engine.

The word cluster is also defined by the SQL Standard in the same way as in Postgres. Closely following the SQL Standard is a primary goal of the Postgres project.

The SQL-92 specification says:

A cluster is an implementation-defined collection of catalogs.

and

Exactly one cluster is associated with an SQL-session

That's an obtuse way of saying a cluster is a database server (each catalog is a database).

Cluster > Catalog > Schema > Table > Columns & Rows

So in both Postgres and the SQL Standard we have this containment hierarchy:

A computer may have one cluster or multiple.
A database server is a cluster.
A cluster has catalogs. ( Catalog = Database )
Catalogs have schemas. (Schema = namespace of tables, and security boundary)
Schemas have tables.
Tables have rows.
Rows have values, defined by columns.
Those values are the business data your apps and users care about such as person's name, invoice due date, product price, gamer’s high score. The column defines the data type of the values (text, date, number, and so on).

Diagram showing nesting boxes representing how connecting on a port gets you to cluster (a database server) which contains one or more Catalogs (a database) each of which contains one or more Schemas (a namespace) each of which contains tables each of which has rows.

Multiple Clusters

This diagram represents a single cluster. In the case of Postgres, you can have more than one cluster per host computer (or virtual OS). Multiple clusters is commonly done, for testing and deploying new versions of Postgres (ex: 9.0, 9.1, 9.2, 9.3, 9.4, 9.5).

If you did have multiple clusters, imagine the diagram above duplicated.

Different port numbers allow the multiple clusters to live side-by-side all up and running at the same time. Each cluster would be assigned its own port number. The usual 5432 is only the default, and can be set by you. Each cluster is listening on its own assigned port for incoming database connections.

Example Scenario

For example, a company could have two different software development teams. One writes software to manage the warehouses while the other team builds software to manage sales and marketing. Each dev team has their own database, blissfully unaware of the other’s.

But the IT operations team took a decision to run both databases on a single computer box (Linux, Mac, whatever). So on that box they installed Postgres. So one database server (database cluster). In that cluster, they create two catalogs, a catalog for each dev team: one named 'warehouse' and one named 'sales'.

Each dev team uses many dozens of tables with different purposes and access roles. So each dev team organizes their tables into schemas. By coincidence, both dev teams do some tracking of accounting data, so each team happens to have a schema named 'accounting'. Using the same schema name is not a problem because the catalogs each have their own namespace so no collision.

Furthermore, each team eventually creates a table for accounting purposes named 'ledger'. Again, no naming collision.

You can think of this example as a hierarchy…

Computer (hardware box or virtualized server)
- Postgres 9.2 cluster (installation)
  - warehouse catalog (database)
    - inventory schema
      - [… some tables]
    - accounting schema
      - ledger table
      - [… some other tables]
  - sales catalog (database)
    - selling schema
      - [… some tables]
    - accounting schema (coincidental same name as above)
      - ledger table (coincidental same name as above)
      - [… some other tables]
- Postgres 9.3 cluster
  - [… other schemas & tables]

Each dev team's software makes a connection to the cluster. When doing so, they must specify which catalog (database) is theirs. Postgres requires that you connect to one catalog, but you are not limited to that catalog. That initial catalog is merely a default, used when your SQL statements omit the name of a catalog.

So if the dev team ever needs to access the other team's tables, they may do so if the database administrator has given them privileges to do so. Access is made with explicit naming in the pattern: catalog.schema.table. So if the 'warehouse' team needs to see the other team’s ('sales' team) ledger, they write SQL statements with sales.accounting.ledger. To access their own ledger, they merely write accounting.ledger. If they access both ledgers in the same piece of source code, they may choose to avoid confusion by including their own (optional) catalog name, warehouse.accounting.ledger versus sales.accounting.ledger.

By the way…

You may hear the word schema used in a more general sense, meaning the entire design of a particular database's table structure. By contrast, in the SQL Standard the word means specifically the particular layer in the Cluster > Catalog > Schema > Table hierarchy.

Postgres uses both the word database as well as catalog in various places such as the CREATE DATABASE command.

Not all database system provides this full hierarchy of Cluster > Catalog > Schema > Table. Some have only a single catalog (database). Some have no schema, just one set of tables. Postgres is an exceptionally powerful product.