Certified Data Mining and Warehousing Professional Data design and dimensional modeling

Data design and dimensional modeling
 


Dimensional modeling process:

The dimensional model is built on a star-like schema, with dimensions surrounding the fact table. To build the schema, the following design model is used:

  1. Choose the business process
  2. Declare the grain
  3. Identify the dimensions
  4. Identify the fact
Choose the business process

The process of dimensional modeling builds on a 4-step design method that helps to ensure the usability of the dimensional model and the use of the data warehouse. The basics in the design build on the actual business process which the data warehouse should cover. Therefore the first step in the model is to describe the business process which the model builds on. This could for instance be a sales situation in a retail store. To describe the business process, one can choose to do this in plain text or use basic Business Process Modeling Notation (BPMN) or other design guides like the Unified Modeling Language (UML).

Declare the grain

After describing the Business Process, the next step in the design is to declare the grain of the model. The grain of the model is the exact description of what the dimensional model should be focusing on. This could for instance be “An individual line item on a customer slip from a retail store”. To clarify what the grain means, you should pick the central process and describe it with one sentence. Furthermore the grain (sentence) is what you are going to build your dimensions and fact table from. You might find it necessary to go back to this step to alter the grain due to new information gained on what your model is supposed to be able to deliver.

Identify the dimensions

The third step in the design process is to define the dimensions of the model. The dimensions must be defined within the grain from the second step of the 4-step process. Dimensions are the foundation of the fact table, and is where the data for the fact table is collected. Typically dimensions are nouns like date, store, inventory etc. These dimensions are where all the data is stored. For example, the date dimension could contain data such as year, month and weekday.

Identify the facts

After defining the dimensions, the next step in the process is to make keys for the fact table. This step is to identify the numeric facts that will populate each fact table row. This step is closely related to the business users of the system, since this is where they get access to data stored in the data warehouse. Therefore most of the fact table rows are numerical, additive figures such as quantity or cost per unit, etc.

Dimension Normalization

Dimensional normalization or snowflaking removes redundant attributes, which are known in the normal flatten de-normalized dimensions. Dimensions are strictly joined together in sub dimensions.

Snowflaking has an influence on the data structure that differs from many philosophies of data warehouses. Single data (fact) table surrounded by multiple descriptive (dimension) tables

Developers often don't normalize dimensions due to several facts:

  1. Normalization makes the data structure more complex
  2. Performance can be slower, due to the many joins between tables
  3. The space savings are minimal
  4. The use of bitmap indexes can't be done
  5. Query Performance, 3NF databases suffer from performance problems when aggregating or retrieving many dimensional values that may require analysis. If you are only going to do operational reports then you may be able to get by with 3NF because your operational user will be looking for very fine grain data.

There are some arguments on why normalization can be useful. It can be an advantage when part of hierarchy is common to more than one dimension. For example, a geographic dimension may be reusable because both the customer and supplier dimensions use it.

Benefits of dimensional modeling

Benefits of the dimensional modeling are following:

  • Understandability - Compared to the normalized model, the dimensional model is easier to understand and more intuitive. In dimensional models, information is grouped into coherent business categories or dimensions, making it easier to read and interpret. Simplicity also allows software to navigate databases efficiently. In normalized models, data is divided into many discrete entities and even a simple business process might result in dozens of tables joined together in a complex way.
  • Query performance - Dimensional models are more denormalized and optimized for data querying, while normalized models seek to eliminate data redundancies and are optimized for transaction loading and updating. The predictable framework of a dimensional model allows the database to make strong assumptions about the data that aid in performance. Each dimension is an equivalent entry point into the fact table, and this symmetrical structure allows effective handling of complex queries. Query optimization for star join databases is simple, predictable, and controllable.
  • Extensibility - Dimensional models are extensible and easily accommodate unexpected new data. Existing tables can be changed in place either by simply adding new data rows into the table or executing SQL alter table commands. No queries or other applications that sit on top of the Warehouse need to be reprogrammed to accommodate changes. Old queries and applications continue to run without yielding different results. But in normalized models each modification should be considered carefully, because of the complex dependencies between database tables.

To build a dimensional database, you start with a dimensional data model. The dimensional data model provides a method for making databases simple and understandable. You can conceive of a dimensional database as a database cube of three or four dimensions where users can access a slice of the database along any of its dimensions. To create a dimensional database, you need a model that lets you visualize the data.

Suppose your business sells products in different markets and evaluates the performance over time. It is easy to conceive of this business process as a cube of data, which contains dimensions for time, products, and markets. Figure 35 shows this dimensional model. The various intersections along the lines of the cube would contain the measures of the business. The measures correspond to a particular combination of product, market, and time data.

Figure 35. A Dimensional Model of a Business That Has Time, Product, and Market Dimensions
begin figure description - A cube is shown. The three axes of the cube (x, y, and z) are marked as

Another name for the dimensional model is the star-join schema. The database designers use this name because the diagram for this model looks like a star with one central table around which a set of other tables are displayed. The central table is the only table in the schema with multiple joins connecting it to all the other tables. This central table is called the fact table and the other tables are called dimension tables. The dimension tables all have only a single join that attaches them to the fact table, regardless of the query. Figure 36 shows a simple dimensional model of a business that sells products in different markets and evaluates business performance over time.

Figure 36. A Typical Dimensional Model
begin figure description - There are four tables represented by rectangles. Each rectangle contains a list of the column names that are part of that table. Lines connect columns between tables where there is a join. The
 

To build a dimensional data model, you need a methodology that outlines the decisions you need to make to complete the database design. This methodology uses a top-down approach because it first identifies the major processes in your organization where data is collected. An important task of the database designer is to start with the existing sources of data that your organization uses. After the processes are identified, one or more fact tables are built from each business process. The following steps describe the methodology you use to build the data model.

To build a dimensional database
  1. Choose the business processes that you want to use to analyze the subject area to be modeled.
  2. Determine the granularity of the fact tables.
  3. Identify dimensions and hierarchies for each fact table.
  4. Identify measures for the fact tables.
  5. Determine the attributes for each dimension table.
  6. Get users to verify the data model.

Although a dimensional database can be based on multiple business processes and can contain many fact tables, the data model that this section describes is based on a single business process and has one fact table.

Choosing a Business Process

A business process is an important operation in your organization that some legacy system supports. You collect data from this system to use in your dimensional database. The business process identifies what end users are doing with their data, where the data comes from, and how to transform that data to make it meaningful. The information can come from many sources, including finance, sales analysis, market analysis, customer profiles. The following list shows different business processes you might use to determine what data to include in your dimensional database:

  • Sales
  • Shipments
  • Inventory
  • Orders
  • Invoices

 

Summary of a Business Process

Suppose your organization wants to analyze customer buying trends by product line and region so that you can develop more effective marketing strategies. In this scenario, the subject area for your data model is sales.

After many interviews and thorough analysis of your sales business process, suppose your organization collects the following information:

  • Customer-base information has changed.

    Previously, sales districts were divided by city. Now the customer base corresponds to two regions: Region 1 for California and Region 2 for all other states.

  • The following reports are most critical to marketing:
    • Monthly revenue, cost, net profit by product line per vendor
    • Revenue and units sold by product, by region, by month
    • Monthly customer revenue
    • Quarterly revenue per vendor
  • Most sales analysis is based on monthly results, but you can choose to analyze sales by week or accounting period (at a later date).
  • A data-entry system exists in a relational database.

    To develop a working data model, you can assume that the relational database of sales information has the following properties:

    • The stores_demo database provides much of the revenue data that the marketing department uses.
    • The product code that analysts use is stored in the catalog table as the catalog number.
    • The product line code is stored in the stock table as the stock number. The product line name is stored as description.
    • The product hierarchies are somewhat complicated. Each product line has many products, and each manufacturer has many products.
  • All the cost data for each product is stored in a flat file named costs.lst on a different purchasing system.
  • Customer data is stored in the stores_demo database.

    The region information has not yet been added to the database.

An important characteristic of the dimensional model is that it uses business labels familiar to end users rather than internal tables or column names. After the business process is completed, you should have all the information you need to create the measures, dimensions, and relationships for the dimensional data model. This dimensional data model is used to implement the sales_demo database that Implementing a Dimensional Database (XPS) describes.

The stores_demo demonstration database is the primary data source for the dimensional data model that this chapter develops. For detailed information about the data sources that are used to populate the tables of the sales_demo database,

Determining the Granularity of the Fact Table

After you gather all the relevant information about the subject area, the next step in the design process is to determine the granularity of the fact table. To do this you must decide what an individual low-level record in the fact table should contain. The components that make up the granularity of the fact table correspond directly with the dimensions of the data model. Thus, when you define the granularity of the fact table, you identify the dimensions of the data model.

How Granularity Affects the Size of the Database

The granularity of the fact table also determines how much storage space the database requires. For example, consider the following possible granularities for a fact table:

  • Product by day by region
  • Product by month by region

The size of a database that has a granularity of product by day by region would be much greater than a database with a granularity of product by month by region because the database contains records for every transaction made each day as opposed to a monthly summation of the transactions. You must carefully determine the granularity of your fact table because too fine a granularity could result in an astronomically large database. Conversely, too coarse a granularity could mean the data is not detailed enough for users to perform meaningful queries against the database.

Using the Business Process to Determine the Granularity

A careful review of the information gathered from the business process should provide what you need to determine the granularity of the fact table. To summarize, your organization wants to analyze customer-buying trends by product line and region so that you can develop more effective marketing strategies.

Customer by Product

The granularity of the fact table always represents the lowest level for each corresponding dimension. When you review the information from the business process, the granularity for customer and product dimensions of the fact table are apparent. Customer and product cannot be reasonably reduced any further: they already express the lowest level of an individual record for the fact table. (In some cases, product might be further reduced to the level of product component because a product could be made up of multiple components.)

Customer by Product by District

Because the customer buying trends your organization wants to analyze include a geographical component, you still need to decide the lowest level for region information. The business process indicates that in the past, sales districts were divided by city, but now your organization distinguishes between two regions for the customer base: Region 1 for California and Region 2 for all other states. Nonetheless, at the lowest level, your organization still includes sales district data, so district represents the lowest level for geographical information and provides a third component to further define the granularity of the fact table.

Customer by Product by District by Day

Customer-buying trends always occur over time, so the granularity of the fact table must include a time component. Suppose your organization decides to create reports by week, accounting period, month, quarter, or year. At the lowest level, you probably want to choose a base granularity of day. This granularity allows your business to compare sales on Tuesdays with sales on Fridays, compare sales for the first day of each month, and so forth. The granularity of the fact table is now complete.

The decision to choose a granularity of day means that each record in the time dimension table represents a day. In terms of the storage requirements, even 10 years of daily data is only about 3,650 records, which is a relatively small dimension table.

Identifying the Dimensions and Hierarchies

After you determine the granularity of the fact table, it is easy to identify the primary dimensions for the data model because each component that defines the granularity corresponds to a dimension. Figure 39 shows the relationship between the granularity of the fact table and the dimensions of the data model.

Figure 39. The Granularity of the Fact Table Corresponds to the Dimensions of the Data Model
begin figure description - The diagram shows that if the granularity of the fact table is

With the dimensions (customer, product, geography, time) for the data model in place, the schema diagram begins to take shape.

Tip:At this point, you can add additional dimensions to the primary granularity of the fact table, where the new dimensions take on only a single value under each combination of the primary dimensions. If you see that an additional dimension violates the granularity because it causes additional records to be generated, then you must revise the granularity of the fact table to accommodate the additional dimension. For this data model, no additional dimensions need to be added.

You can now map out dimension elements and hierarchies for each dimension. Figure 40 shows the relationships among dimensions, dimension elements, and the inherent hierarchies.

Figure 40. The Relationships Between Dimensions, Dimension Elements, and the Inherent Hierarchies
begin figure description - The dimension elements for the product dimension are product, product line, and vendor. Product has a roll-up hierarchical relationship with product line and with vendor. Product has an attribute of product name. Product line has an attribute of product line name. Vendor has an attribute of vendor. The dimension element for the customer dimension is customer, which has attributes of customer, name, and company. The dimension elements for the geography dimension are district, state, and region. District has a roll-up hierarchical relationship with state, which has a roll-up hierarchical relationship with region. District has an attribute of district name. State has an attribute of state name. The dimensional elements for the time dimension are day, month, quarter, and year. Day has a roll-up hierarchical relationship with month, which has a roll-up hierarchical relationship with quarter, which has a roll-up hierarchical relationship with year. Day has an attribute of order date. - end figure description

In most cases, the dimension elements need to express the lowest possible granularity for each dimension, not because queries need to access individual low-level records, but because queries need to cut through the database in precise ways. In other words, even though the questions that a data warehousing environment poses are usually broad, these questions still depend on the lowest level of product detail.

Choosing the Measures for the Fact Table

The measures for the data model include not only the data itself, but also new values that you calculate from the existing data. When you examine the measures, you might discover that you need to make adjustments either in the granularity of the fact table or the number of dimensions.

Another important decision you must make when you design the data model is whether to store the calculated results in the fact table or to derive these values at runtime.

The question to answer is, "What measures are used to analyze the business?" Remember that the measures are the quantitative or factual data that tell how much or how many. The information that you gather from analysis of the sales business process results in the following list of measures:

  • Revenue
  • Cost
  • Units sold
  • Net profit

Use these measures to complete the fact table in Figure 41.

Figure 41. The Sales Fact Table References Each Dimension Table
begin figure description - The

Using Keys to Join the Fact Table with the Dimension Tables

Assume, for the moment, that the schema of Figure 41 shows both the logical and physical design of the database. The database contains the following five tables:

  • Sales fact table
  • Product dimension table
  • Time dimension table
  • Customer dimension table
  • Geography dimension table

Each of the dimensional tables includes a primary key (product, time_code, customer, district_code), and the corresponding columns in the fact table are foreign keys. The fact table also has a primary (composite) key that is a combination of these four foreign keys. As a rule, each foreign key of the fact table must have its counterpart in a dimension table. Furthermore, any table in a dimensional database that has a composite key must be a fact table, which means that every table in a dimensional database that expresses a many-to-many relationship is a fact table.

Tip:The primary key should be a short numeric data type (INT, SMALLINT, SERIAL) or a short character string (as used for codes). Do not use long character strings as primary keys.
 
 
 For Support