Certified Data Mining and Warehousing Professional Star Schema star schema keys and advantages

Star Schema star schema keys and advantages
 


The star schema (also called star-join schema, data cube, or multi-dimensional schema) is the simplest style of data warehouse schema. The star schema consists of one or more fact tables referencing any number of dimension tables. The star schema is an important special case of the snowflake schema, and is more effective for handling simpler queries.

Model

A star schema classifies the attributes of an event into facts (measured numeric/time data), and descriptive dimension attributes (product ID, customer name, sale date) that give the facts a context. A fact record is the nexus between the specific dimension values and the recorded facts. The facts are stored at a uniform level of detail (the grain) in the fact table. Dimension attributes are organized into affinity groups and stored in a minimal number of dimension tables.

A weather star schema that records weather data may have facts of temperature, barometric pressure, wind speed, precipitation, cloud cover, etc. and dimensions of location, date/time, reporter, etc.

Star schemas are designed to optimize user ease-of-use and retrieval performance by minimizing the number of tables to join to materialize a transaction.

A star schema is called such as it comprises a constellation of stars, generally several bright stars (facts) surrounded by dimmer ones (dimensions).

  • The fact table holds the metric values recorded for a specific event. Because of the desire to hold atomic level data, there generally are a very large number of records (billions). Special care is taken to minimize the number and size of attributes in order to constrain the overall table size and maintain performance. Fact tables generally come in 3 flavors - transaction (facts about a specific event e.g. Sale), snapshot (facts recorded at a point in time e.g. Account details at month end), and accumulating snapshot tables (e.g. month-to-date sales for a product).
  • Dimension tables usually have few records compared to fact tables, but may have a very large number of attributes that describe the fact data.

Often there can be dozens to hundreds of dimension attributes describing the various facets of a fact. Dimension attributes are organized into tables of loosely related attributes that share a known or unknown affinity. Attributes of color, style, size, texture can describe a product and would be included in a product dimension table. Dimension tables include attributes that typically would be normalized into separate tables (Snowflake schema). For example, in the US a location can be identified by a zipcode that exists within a neighborhood, city, state, region. All of these attributes would be included in a location dimension table.

On an Entity-Relationship (ER) diagram, fact tables have few distinct columns, while dimension tables have a large number of columns. However, most of the storage is used by the fact table.

Dimension tables are assigned a surrogate primary key of a simple integer that is assigned to the combination of low level attributes that form the natural key. Fact tables should also have a single surrogate primary key to allow for situations where there may be two or more facts having exactly the same set of dimension keys.

A star schema that has many dimensions is sometimes called a centipede schema. Having dimensions of only a few attributes, while simpler to maintain, results in queries with many table joins and makes the star schema less easy to use.

Benefits

The primary benefit of a star schema is its simplicity for users to write, and databases to process: queries are written with simple inner joins between the facts and a small number of dimensions. Star joins are simpler than possible in snowflake schema. WHERE conditions need only to filter on the attributes desired, and aggregations are fast.

The star schema is a way to implement multidimensional database (MDDB) functionality using a mainstream relational database.

Example

Star schema used by example query.

Consider a database of sales, perhaps from a store chain, classified by date, store and product. The image of the schema to the right is a star schema version of the sample schema provided in the snowflake schema article.

Fact_Salesis the fact table and there are three dimension tables Dim_Date, Dim_Storeand Dim_Product.

Each dimension table has a primary key on its Idcolumn, relating to one of the columns (viewed as rows in the example schema) of the Fact_Salestable's three-column (compound) primary key (Date_Id, Store_Id, Product_Id). The non-primary key Units_Soldcolumn of the fact table in this example represents a measure or metric that can be used in calculations and analysis. The non-primary key columns of the dimension tables represent additional attributes of the dimensions (such as the Yearof the Dim_Datedimension).

For example, the following query answers how many TV sets have been sold, for each brand and country, in 1997:

SELECT
        P.Brand,
        S.Country,
        SUM(F.Units_Sold)
FROM Fact_Sales F
INNER JOIN Dim_Date D    ON F.Date_Id = D.Id
INNER JOIN Dim_Store S   ON F.Store_Id = S.Id
INNER JOIN Dim_Product P ON F.Product_Id = P.Id
WHERE
        D.YEAR = 1997
AND P.Product_Category = 'tv'
GROUP BY
        P.Brand,
        S.Country

Star schemas

A star schema consists of fact tables and dimension tables. Fact tables contain the quantitative or factual data about a business--the information being queried. This information is often numerical, additive measurements and can consist of many columns and millions or billions of rows. Dimension tables are usually smaller and hold descriptive data that reflects the dimensions, or attributes, of a business. SQL queries then use joins between fact and dimension tables and constraints on the data to return selected information.

Fact and dimension tables differ from each other only in their use within a schema. Their physical structure and the SQL syntax used to create the tables are the same. In a complex schema, a given table can act as a fact table under some conditions and as a dimension table under others. The way in which a table is referred to in a query determines whether a table behaves as a fact table or a dimension table.

Even though they are physically the same type of table, it is important to understand the difference between fact and dimension tables from a logical point of view. To demonstrate the difference between fact and dimension tables, consider how an analyst looks at business performance:

  • A salesperson analyzes revenue by customer, product, market, and time period.
  • A financial analyst tracks actuals and budgets by line item, product, and time period.
  • A marketing person reviews shipments by product, market, and time period.

The facts--what is being analyzed in each case--are revenue, actuals and budgets, and shipments. These items belong in fact tables. The business dimensions--the by items--are product, market, time period, and line item. These items belong in dimension tables.

For example, a fact table in a sales database, implemented with a star schema, might contain the sales revenue for the products of the company from each customer in each geographic market over a period of time. The dimension tables in this database define the customers, products, markets, and time periods used in the fact table.

A well-designed schema provides dimension tables that allow a user to browse a database to become familiar with the information in it and then to write queries with constraints so that only the information that satisfies those constraints is returned from the database.

Performance of star schemas

Performance is an important consideration of any schema, particularly with a decision-support system in which you routinely query large amounts of data. IBM Red Brick Warehouse supports all schema designs. However, star schemas tend to perform the best in decision-support applications.

For more information on the performance implications of star schemas, see the Query Performance Guide.

Terminology

The terms fact table and dimension table represent the roles these objects play in the logical schema. In terms of the physical database, a fact table is a referencing table. That is, it has foreign key references to other tables. A dimension table is a referenced table. That is, it has a primary key that is a foreign key reference from one or more tables.

Simple star schemas

Any table that references or is referenced by another table must have a primary key, which is a column or group of columns whose contents uniquely identify each row. In a simple star schema, the primary key for the fact table consists of one or more foreign keys. A foreign key is a column or group of columns in one table whose values are defined by the primary key in another table. In IBM Red Brick Warehouse, you can use these foreign keys and the primary keys in the tables that they reference to build STAR indexes, which improve data retrieval performance.

When a database is created, the SQL statements used to create the tables must designate the columns that are to form the primary and foreign keys.

The following figure illustrates the relationship of the fact and dimension tables within a simple star schema with a single fact table and three dimension tables. The fact table has a primary key composed of three foreign keys, Key1, Key2, and Key3, each of which is the primary key in a dimension table. Nonkey columns in a fact table are referred to as data columns. In a dimension table, they are referred to as attributes.

Figure 11. Simple Star Schema
Three Dimension tables are connected to a Fact table.

In the figures used to illustrate schemas:

  • The items listed within the box under each table name indicate columns in the table.
  • Primary key columns are labeled in bold type.
  • Foreign key columns are labeled in italic type.
  • Columns that are part of the primary key and are also foreign keys are labeled in bold italic type.
  • Foreign key relationships are indicated by lines connecting tables.

    Although the primary key value must be unique in each row of a dimension table, that value can occur multiple times in the foreign key in the fact table--a many-to-one relationship. 

The following figure illustrates a sales database designed as a simple star schema. In the fact table Sales, the primary key is composed of three foreign keys, Product_id, Period_id, and Market_id, each of which references a primary key in a dimension table.

Figure 12. Sales Database
The Period, Product, and Market tables are connected to the Sales table.

Many-to-one relationships exist between the foreign keys in the fact table and the primary keys they reference in the dimension tables. For example, the Product table defines the products. Each row in the table represents a distinct product and has a unique product identifier. That product identifier can occur multiple times in the Sales table representing sales of that product during each period and in each market.

Multiple fact tables

A star schema can contain multiple fact tables. In some cases, multiple fact tables exist because they contain unrelated facts; for example, invoices and sales. In other cases, they exist because they improve performance. For example, multiple fact tables are often used to hold various levels of aggregated (summary) data, particularly when the amount of aggregation is large; for example, daily sales, monthly sales, and yearly sales.

The following figure illustrates the Sales database with an additional fact table for sales from the previous year.

Figure 13. Sales database with additional dimension
The Period table is connected to Sales_Previous and Sales_Current. The Product table is connected to Sales_Current and Sales_Previous. The Market table is connected only to Sales_Current.

Another use of a referencing table is to define a many-to-many relationship between some dimensions of the business. This type of table is often known as a cross-reference or associative table. For example, in the Sales database, each product belongs to one or more groups, and each group contains multiple products, a many-to-many relationship that is modeled by establishing a referencing table that defines the possible combinations of products and groups.

Figure 14. Sales database with cross-reference table
The Group table is connected to the Product/Group table which is connected to the Product table which is connected to Sales_Current. The Period table connects to Sales_Current. The Market table also connects to Sales_Current.

Multicolumn foreign key

Another way to define a many-to-many relationship is to have a dimension table with a multicolumn primary key that is a foreign key reference from a fact table. For example, in the Sales database, each product belongs to one or more groups, and each group contains multiple products, a many-to-many relationship. This relationship is modeled by defining a multicolumn foreign key in the Sales_Current table that references the Product table, as in the following example.

Figure 15. Sales database with multicolumn foreign key
The Period, Product and Market tables connect to Sales_Current.

In the preceding figure, the Product_id and Group_id columns are the two-column primary key of the Product table and are a two-column foreign key reference from the Sales_Current table.

Outboard tables

Dimension tables can also contain one or more foreign keys that reference the primary key in another dimension table. The referenced dimension tables are sometimes referred to as outboard, outrigger, or secondary dimension tables. The following figure includes two outboard tables, District and Region, which define the ID codes used in the Market table.

Figure 16. Sales database with outboard tables
The District and Region tables connect to the Market table, which connects to Sales_Current. The Period and Product tables also connect to Sales_Current.

In the preceding figure, the Market table, because it is both a referencing and referenced table, can behave as a fact (referencing) or dimension (referenced) table, depending on how it is used in a query.

Multistar schemas

In a simple star schema, the primary key in the fact table is formed by concatenating the foreign key columns. In some applications, however, the concatenated foreign keys might not provide a unique identifier for each row in the fact table. These applications require a multistar schema.

In a multistar schema, the fact table has both a set of foreign keys, which reference dimension tables, and a primary key, which consists of one or more columns that provide a unique identifier for each row. The primary key and the foreign keys are not identical in a multistar schema. This fact distinguishes a multistar schema from a single-star schema.

The following figure illustrates the relationship of the fact and dimension tables within a multistar schema. In the fact table, the foreign keys are Fkey1, Fkey2, and Fkey3, each of which is the primary key in a dimension table. Unlike the simple star schema, these columns do not form the primary key in the fact table. Instead, the two columns Key1 and Key2, which do not reference any dimension tables, and Fkey1, which does reference a dimension table, are concatenated to form the primary key. The primary key can consist of any combination of foreign key and other columns in a multistar schema.

Figure 17. Relationship of fact and dimension tables in multistar schema
Dimension tables Pkey1, 2 and 3 connect to the Fact table.

The following figure illustrates a retail sales database designed as a multistar schema with two outboard tables. The fact table Transact records daily sales in a rolling seven-day database. The primary key for the fact table consists of three columns: Date, Receipt, and Line_item. These keys together provide the unique identifier for each row. The foreign keys are the columns for Store_id and SKU_id, which reference the Store and SKU (storekeeping unit) dimension tables. Two outboard tables, Class and Subclass, are referenced by the SKU dimension table.

Figure 18. Multistar schema with two outboard tables
Outboard tables Class and Subclass connect to the SKU table which connects to the Transact table. The Store table also connects to the Transact table.

In this database schema, analysts can query the transaction table to obtain information on sales of each item, sales by store or region, sales by date, or other interesting information.

In a multistar schema, unlike a simple star schema, the same value for the concatenated foreign key in the fact table can occur in multiple rows, so the concatenated foreign key no longer uniquely identifies each row. For example, in this case the same store (Store_id) might have multiple sales of the same item (SKU_id) on the same day (Date). Instead, row identification is based on the primary key or keys. Each row is uniquely identified by Date, Receipt, and Line_item.

Views

In some databases, schema design can be simplified by the use of views, which effectively create a virtual table by selecting a combination of rows and columns from an existing table or combination of tables. For example, a view that selects employee names and telephone extensions from an employee database produces a company phone list but does not include confidential information such as addresses and salaries. A view that selects transactions that occur within a given time period avoids the need to constrain queries to that time period.

Views are useful for a wide variety of purposes, including the following:

  • Increasing security
  • Simplifying complex tables to give users a view of only what they need
  • Simplifying query constraints
  • Simplifying administrative tasks, such as granting table authorizations
  • Hiding administrative changes to users

    The database schema changes design, but the view to the user remains the same.

A view is created with a CREATE VIEW statement.

 

 For Support