Certified Data Mining and Warehousing Professional Snowflake schema aggregate fact tables and families of stars

Snowflake schema aggregate fact tables and families of stars
 


A snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake in shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions.

The snowflake schema is similar to the star schema. However, in the snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema's dimensions are normalized with each dimension represented by a single table. A complex snowflake shape emerges when the dimensions of a snowflake schema are elaborate, having multiple levels of relationships, and the child tables have multiple parent tables ("forks in the road"). The "snowflaking" effect only affects the dimension tables and NOT the fact tables.

Common uses

Star and snowflake schemas are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schemas are not normalized much, and are frequently designed at a level of normalization short of third normal form.

Deciding whether to employ a star schema or a snowflake schema should involve considering the relative strengths of the database platform in question and the query tool to be employed. Star schemas should be favored with query tools that largely expose users to the underlying table structures, and in environments where most queries are simpler in nature. Snowflake schemas are often better with more sophisticated query tools that create a layer of abstraction between the users and raw table structures for environments having numerous queries with complex criteria.

Data normalization and storage

Normalization splits up data to avoid redundancy (duplication) by moving commonly repeating groups of data into new tables. Normalization therefore tends to increase the number of tables that need to be joined in order to perform a given query, but reduces the space required to hold the data and the number of places where it needs to be updated if the data changes.

From a space storage point of view, the dimensional tables are typically small compared to the fact tables. This often removes the storage space benefit of snowflaking the dimension tables, as compared with a star schema.

Some database developers compromise by creating an underlying snowflake schema with views built on top of it that perform many of the necessary joins to simulate a star schema. This provides the storage benefits achieved through the normalization of dimensions with the ease of querying that the star schema provides. The tradeoff is that requiring the server to perform the underlying joins automatically can result in a performance hit when querying as well as extra joins to tables that may not be necessary to fulfill certain queries.

Benefits of "snowballings"

  • Some OLAP multidimensional database modeling tools that use dimensional data marts as data sources are optimized for snowflake schemas.
  • If a dimension is very sparse (i.e. most of the possible values for the dimension have no data) and/or a dimension has a very long list of attributes which may be used in a query, the dimension table may occupy a significant proportion of the database and snowflaking may be appropriate.
  • A multidimensional view is sometimes added to an existing transactional database to aid reporting. In this case, the tables which describe the dimensions will already exist and will typically be normalized. A snowflake schema will therefore be easier to implement.
  • A snowflake schema can sometimes reflect the way in which users think about data. Users may prefer to generate queries using a star schema in some cases, although this may or may not be reflected in the underlying organization of the database.
  • Some users may wish to submit queries to the database which, using conventional multidimensional reporting tools, cannot be expressed within a simple star schema. This is particularly common in data mining of customer databases, where a common requirement is to locate common factors between customers who bought products meeting complex criteria. Some snowflaking would typically be required to permit simple query tools to form such a query, especially if provision for these forms of query weren't anticipated when the data warehouse was first designed.

Examples

Snowflake schema used by example query.

The example schema shown to the right is a snowflaked version of the star schema example provided in the star schema article.

The following example query is the snowflake schema equivalent of the star schema example code which returns the total number of units sold by brand and by country for 1997. Notice that the snowflake schema query requires many more joins than the star schema version in order to fulfill even a simple query. The benefit of using the snowflake schema in this example is that the storage requirements are lower since the snowflake schema eliminates many duplicate values from the dimensions themselves.

SELECT
        B.Brand,
        G.Country,
        SUM(F.Units_Sold)
FROM Fact_Sales F
INNER JOIN Dim_Date D             ON F.Date_Id = D.Id
INNER JOIN Dim_Store S            ON F.Store_Id = S.Id
INNER JOIN Dim_Geography G        ON S.Geography_Id = G.Id
INNER JOIN Dim_Product P          ON F.Product_Id = P.Id
INNER JOIN Dim_Brand B            ON P.Brand_Id = B.Id
INNER JOIN Dim_Product_Category C ON P.Product_Category_Id = C.Id
WHERE
        D.YEAR = 1997 AND
        C.Product_Category = 'tv'
GROUP BY
        B.Brand,
        G.Country


Snowflaking is a form of dimensional modeling in which dimensions are stored in multiple related dimension tables.  A snowflake schema is a variation of the star schema.

Snowflaking is used to improve the performance of certain queries.  The schema is diagramed with each fact surrounded by its associated dimensions (as in a star schema), and those dimensions are further related to other dimensions, branching out into a snowflake pattern.

Data warehouses and data marts may use snowflaking to support specific query needs. Snowflaking can improve query performance against low cardinality attributes that are queried independently.  Business intelligence applications that use a relational OLAP (ROLAP) architecture may perform better when the data warehouse schema is snowflaked.

A star schema stores all attributes for a dimension into one denormalized (“flattened”) table. This requires more disk space than a more normalized snowflake schema.  Snowflaking normalizes the dimension by moving attributes with low cardinality (few distinct values) into separate dimension tables that relate to the core dimension table by using foreign keys. Snowflaking for the sole purpose of minimizing disk space is not recommended, however, because it can adversely impact query performance.

Cases for snowflaking include:

  • Sparsely populated attributes, where most dimension member records have a NULL value for the attribute, are moved to a sub-dimension.
  •  
  • Low cardinality attributes that are queried independently.  For example, a product dimension may contain thousands of products, but only a handful of product types.  Moving the product type attribute to its own dimension table can improve performance when the product types are queried independently.
  •  
  • Attributes that are part of a hierarchy and are queried independently.  Examples include the year, quarter, and month attributes of a date hierarchy; and the country and state attributes of a geographic hierarchy.

Comparison chart

  Snowflake Schema Star Schema
Ease of maintenance/change: No redundancy and hence more easy to maintain and change Has redundant data and hence less easy to maintain/change
Ease of Use: More complex queries and hence less easy to understand Less complex queries and easy to understand
Query Performance: More foreign keys-and hence more query execution time Less no. of foreign keys and hence lesser query execution time
Normalization: Has normalized tables Has De-normalized tables
Type of Datawarehouse: Good to use for small datawarehouses/datamarts Good for large datawarehouses
Joins: Higher number of Joins Fewer Joins
Dimension table: It may have more than one dimension table for each dimension Contains only single dimension table for each dimension
When to use: When dimension table is relatively big in size, snowflaking is better as it reduces space. When dimension table contains less number of rows, we can go for Star schema.

Aggregate fact tables

Aggregate fact tables are special fact tables in a data warehouse that contain new metrics derived from one or more aggregate functions (AVERAGE, COUNT, MIN, MAX, etc..) or from other specialized functions that output totals derived from a grouping of the base data. These new metrics, called “aggregate facts” or “summary statistics” are stored and maintained in the data warehouse database in special fact tables at the grain of the aggregation. Likewise, the corresponding dimensions are rolled up and condensed to match the new grain of the fact.

These specialized tables are used as substitutes whenever possible for returning user queries. The reason? Speed. Querying a tidy aggregate table is much faster and uses much less disk I/O than the base, atomic fact table, especially if the dimensions are large as well. If you want to wow your users, start adding aggregates. You can even use this “trick” in your operational systems to serve as a foundation for operational reports. I’ve always done this for any report referred to by my users as a “Summary”. (As an aside, there is a difference between an “aggregate” and a “summary”. I’ll explore these differences in my next post.)

For example, take the “Orders” business process from an online catalog company where you might have customer orders in a fact table called FactOrders with dimensions Customer, Product, and OrderDate. With possibly millions of orders in the transaction fact, it makes sense to start thinking about aggregates.

To further the above example, assume that the business is interested in a report: “Monthly orders by state and product type”. While you could generate this easily enough using the FactOrders fact table, you could likely speed up the data retrieval for the report by at least half (but likely much, much more) using an aggregate.

Here, using the atomic transaction FactOrders table:

SELECT c.state, p.product_type, t.year, t.month, SUM(f.order_amount)
    FROM FactOrders f
    JOIN DimCustomer c ON c.CustomerID = f.CustomerID
    JOIN DimProduct p ON p.ProductID = f.ProductID
    JOIN DimTime t ON 
        t.year = DATEPART(yy, f.YearMonthID) AND 
        t.month = DATEPART(mm, f.DateID)
    GROUP BY c.state, p.product_type, t.year, t.month

The aggregate is querying much less data and queries against time are now much simpler. In my non-scientific tests, the following query ran many times faster (a few seconds compared to about 30 seconds!).

SELECT c.state, p.product_type, t.year, t.month, SUM(f.order_amount)
    FROM FactOrders_Agg1 f
    JOIN DimCustomerState c ON c.CustomerStateID = f.CustomerStateID
    JOIN DimProductType p ON p.ProductTypeID = f.ProductTypeID
    JOIN DimMonth t ON t.YearMonthID = f.YearMonthID
    GROUP BY c.state, p.product_type, t.year, t.month

Creating the Fact and Dimensions

To implement, you will need to roll up your fact table by the hierarchies found in your dimensions. The result will be a new fact table, a set of new accompanying dimensions at the grain of the fact, and all new foreign keys for mapping. I usually name the fact table the same as the base fact with some meaningful suffix appended to the end. In SSMS, this keeps the aggregates with the fact in my object explorer. Dimensions usually get new names (like CustomerState and ProductType) and should be conformed so that they can be reused across business processes. You could even create views instead of new dimensions, but this does not eliminate the need to regenerate new surrogate keys.

When rolling up dimensions, you are provided with an excellent opportunity to perform aggregate functions on the dimension itself and store the results as new attributes. For example, you may want to know how many customers are living in each state. This could be used as the denominator in some population calculation you plan to use against the aggregate fact. Your new dimension might therefore look like the following:

SELECT Cust_Country, Cust_Region, Cust_State, COUNT(Cust_ID) 
FROM DimCustomer 
GROUP BY Cust_Country, Cust_Region, Cust_State

The most obvious aggregate function to use is COUNT, but depending on the type of data you have in your dimensions, other functions may prove useful. Just be warned: If you find that aggregate functions are being used a lot on your dimensions, you may need to revisit your design. There may be opportunities to pull out those metrics into existing or new fact tables!

Generating aggregates is largely an incremental process, where you examine query and reporting usage looking for places to improve performance. Aggregates stored in the RDBMS are maintained through ETL and/or your OLAP engine.

A Note About OLAP Aggregates

Theoretically, storing aggregates in a fact table in a RDBMS is the same as storing them in an OLAP cube. In OLAP storage, aggregates are precalculated summaries of data from the different dimensions of the cube, such that a query that seeks to know the aggregate (sum) of some metric (order amount) for X (customer state) and Y (product type) over T (monthly orders) would only need to look inside the cube at those exact coordinates to get the answer. I won’t pretend to know how this stuff is physically stored, but OLAP engines across the board offer better performance, management, and navigation mechanisms for the aggregations than is available through the RDBMS (even when using Indexed or Materialized Views).

 

Aggregates are used in dimensional models of the data warehouse to produce dramatic positive effects on the time it takes to query large sets of data. At the simplest form an aggregate is a simple summary table that can be derived by performing a Group by SQL query. A more common use of aggregates is to take a dimension and change the granularity of this dimension. When changing the granularity of the dimension the fact table has to be partially summarized to fit the new grain of the new dimension, thus creating new dimensional and fact tables, fitting this new level of grain. Aggregates are sometimes referred to as pre-calculated summary data, since aggregations are usually precomputed, partially summarized data, that are stored in new aggregated tables. When facts are aggregated, it is either done by eliminating dimensionality or by associating the facts with a rolled up dimension. Rolled up dimensions should be shrunken versions of the dimensions associated with the granular base facts. This way, the aggregated dimension tables should conform to the base dimension tables. So the reason why aggregates can make such a dramatic increase in the performance of the data warehouse is the reduction of the number of rows to be accessed when responding to a query.

A fact table consists of the measurements, metrics or facts of a business process. It is located at the center of a star schema or a snowflake schema surrounded by dimension tables. Where multiple fact tables are used, these are arranged as a fact constellation schema.

Fact tables provide the (usually) additive values that act as independent variables by which dimensional attributes are analyzed. Fact tables are often defined by their grain. The grain of a fact table represents the most atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as "Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely defined by a day, product and store. Other dimensions might be members of this fact table (such as location/region) but these add nothing to the uniqueness of the fact records. These "affiliate dimensions" allow for additional slices of the independent facts but generally provide insights at a higher level of aggregation (a region contains many stores).


  1. Almost all data warehouses contain multiple STAR schema structures. Each STAR serves a specific purpose to track the measures stored in the fact table.
  2. A set of related STAR schemas make up a family of STARS. Families of STARS are formed for various reasons. A family can be just formed adding aggregate fact tables and the derived dimension tables to support the aggregates. Sometimes, a core fact table can be created containing facts interesting to most users and customized fact tables for specific user groups.
  3. The fact tables of the STARS in a family share dimension tables. Usually, the time dimension is shared by most of the fact tables in the group. Also, dimension tables from multiple STARS may share the fact table of one STAR.
  4. Examples are snapshot and transaction tables, core and custom tables, and tables supporting a value chain or a value circle. A family of STARS relies on conformed dimension tables and standardized fact tables.

 

A dimensional model of a data warehouse for a large data warehouse consists of between 10 and 25 similar-looking star-join schemas. Each star join will have 5 to 15 dimensional tables.

  • Conformed (shared) dimensions facilitate drillacross.
  • A conformed dimension is a dimension that means the same thing with every possible fact table to which it is joined.
  • Conformed dimensions are either identical or strict mathematical subsets of the most granular detailed dimension

Value chains as families of star-join schemas

  • There are two sides to the value chain
  •  –the demand side - the steps needed to satisfy the customers’ demand for the product
  •  –the supply side - the steps needed to manufacture the products from original ingredients or parts
  • The chain consists of a sequence of inventory and flow star-join schemata
  • joining the different star-join schemata is only possible when two sequential schemata have a common, identical dimension
  • Sometimes the represented chain can be extended beyond the bounds of the business itself
 For Support