What's the difference between a catalog and a schema in a relational database?

I used to think schema were the "upper wrapper" object before the database itself. I mean DB.schema.<what_ever_object_name_under_schema>.

Well, the catalog "wrapper" is now quite confusing. Why should we need a catalog? For what purpose, precisely should the catalog be used?

112074 次浏览

From the relational point of view :

The catalog is the place where--among other things--all of the various schemas (external, conceptual, internal) and all of the corresponding mappings (external/conceptual, conceptual/internal) are kept.

In other words, the catalog contains detailed information (sometimes called descriptor information or metadata) regarding the various objects that are of interest to the system itself.

For example, the optimizer uses catalog information about indexes and other physical storage structures, as well as much other information, to help it decide how to implement user requests. Likewise, the security subsystem uses catalog information about users and security constraints to grant or deny such requests in the first place.

An Introduction to Database Systems, 7th ed., C.J. Date, p 69-70.


From the SQL standard point of view :

Catalogs are named collections of schemas in an SQL-environment. An SQL-environment contains zero or more catalogs. A catalog contains one or more schemas, but always contains a schema named INFORMATION_SCHEMA that contains the views and domains of the Information Schema.

Database Language SQL, (Proposed revised text of DIS 9075), p 45


From the SQL point of view :

A catalog is often synonymous with database. In most SQL dbms, if you query the information_schema views, you'll find that values in the "table_catalog" column map to the name of a database.

If you find your platform using catalog in a broader way than any of these three definitions, it might be referring to something broader than a database--a database cluster, a server, or a server cluster. But I kind of doubt that, since you'd have found that easily in your platform's documentation.

Mike Sherrill 'Cat Recall' gave an excellent answer. I'll add simply one example: Postgres.

Cluster = A Postgres Installation

When you install Postgres on a machine, that installation is called a cluster. ‘Cluster’ here is not meant in the hardware sense of multiple computers working together. In Postgres, cluster refers to the fact that you can have multiple unrelated databases all up and running using the same Postgres server engine.

The word cluster is also defined by the SQL Standard in the same way as in Postgres. Closely following the SQL Standard is a primary goal of the Postgres project.

The SQL-92 specification says:

A cluster is an implementation-defined collection of catalogs.

and

Exactly one cluster is associated with an SQL-session

That's an obtuse way of saying a cluster is a database server (each catalog is a database).

Cluster > Catalog > Schema > Table > Columns & Rows

So in both Postgres and the SQL Standard we have this containment hierarchy:

  • A computer may have one cluster or multiple.
  • A database server is a cluster.
  • A cluster has catalogs. ( Catalog = Database )
  • Catalogs have schemas. (Schema = namespace of tables, and security boundary)
  • Schemas have tables.
  • Tables have rows.
  • Rows have values, defined by columns.
    Those values are the business data your apps and users care about such as person's name, invoice due date, product price, gamer’s high score. The column defines the data type of the values (text, date, number, and so on).

Diagram showing nesting boxes representing how connecting on a port gets you to cluster (a database server) which contains one or more Catalogs (a database) each of which contains one or more Schemas (a namespace) each of which contains tables each of which has rows.

Multiple Clusters

This diagram represents a single cluster. In the case of Postgres, you can have more than one cluster per host computer (or virtual OS). Multiple clusters is commonly done, for testing and deploying new versions of Postgres (ex: 9.0, 9.1, 9.2, 9.3, 9.4, 9.5).

If you did have multiple clusters, imagine the diagram above duplicated.

Different port numbers allow the multiple clusters to live side-by-side all up and running at the same time. Each cluster would be assigned its own port number. The usual 5432 is only the default, and can be set by you. Each cluster is listening on its own assigned port for incoming database connections.

Example Scenario

For example, a company could have two different software development teams. One writes software to manage the warehouses while the other team builds software to manage sales and marketing. Each dev team has their own database, blissfully unaware of the other’s.

But the IT operations team took a decision to run both databases on a single computer box (Linux, Mac, whatever). So on that box they installed Postgres. So one database server (database cluster). In that cluster, they create two catalogs, a catalog for each dev team: one named 'warehouse' and one named 'sales'.

Each dev team uses many dozens of tables with different purposes and access roles. So each dev team organizes their tables into schemas. By coincidence, both dev teams do some tracking of accounting data, so each team happens to have a schema named 'accounting'. Using the same schema name is not a problem because the catalogs each have their own namespace so no collision.

Furthermore, each team eventually creates a table for accounting purposes named 'ledger'. Again, no naming collision.

You can think of this example as a hierarchy…

  • Computer (hardware box or virtualized server)
    • Postgres 9.2 cluster (installation)
      • warehouse catalog (database)
        • inventory schema
          • [… some tables]
        • accounting schema
          • ledger table
          • [… some other tables]
      • sales catalog (database)
        • selling schema
          • [… some tables]
        • accounting schema (coincidental same name as above)
          • ledger table (coincidental same name as above)
          • [… some other tables]
    • Postgres 9.3 cluster
      • [… other schemas & tables]

Each dev team's software makes a connection to the cluster. When doing so, they must specify which catalog (database) is theirs. Postgres requires that you connect to one catalog, but you are not limited to that catalog. That initial catalog is merely a default, used when your SQL statements omit the name of a catalog.

So if the dev team ever needs to access the other team's tables, they may do so if the database administrator has given them privileges to do so. Access is made with explicit naming in the pattern: catalog.schema.table. So if the 'warehouse' team needs to see the other team’s ('sales' team) ledger, they write SQL statements with sales.accounting.ledger. To access their own ledger, they merely write accounting.ledger. If they access both ledgers in the same piece of source code, they may choose to avoid confusion by including their own (optional) catalog name, warehouse.accounting.ledger versus sales.accounting.ledger.


By the way…

You may hear the word schema used in a more general sense, meaning the entire design of a particular database's table structure. By contrast, in the SQL Standard the word means specifically the particular layer in the Cluster > Catalog > Schema > Table hierarchy.

Postgres uses both the word database as well as catalog in various places such as the CREATE DATABASE command.

Not all database system provides this full hierarchy of Cluster > Catalog > Schema > Table. Some have only a single catalog (database). Some have no schema, just one set of tables. Postgres is an exceptionally powerful product.