Certified Data Mining and Warehousing Professional Indexing B Tree Clustered etc

Indexing B Tree Clustered etc
 


Indexing a data warehouse is tricky. If you have too few indexes, the data loads quickly but the query response is slow. If you have too many indexes, the data loads slowly and your storage requirements go through the roof but the query response is good. Indexing in any database, transactional or warehouse, most often reduces the length of time it takes to see query results. This is especially true with large tables and complex queries that involve table joins.

Some of the variables that you’ll want to take into account when indexing the data warehouse are the type of data warehouse you have (i.e., primarily archive or primarily near real-time), how large the dimensions and fact tables are (and if the fact tables are partitioned), who will be accessing the data and how they'll do so, and whether access will be ad hoc or via structured application interfaces. These variables will determine how your indexing scheme should be structured. Here’s a simple plan for indexing the relational tables that comprise a portion of your data warehouse. (Although I only explain how to index dimensions and fact tables in this article, I explain how to index the staging database in the Web-exclusive sidebar “Indexing the Staging Database,” www.sqlmag.com, InstantDoc ID 99494.) Note that the relational tables are those that are managed by SQL Server’s relational data engine, not those managed by the SQL Server Analysis Services (SSAS) engine.

Indexing Dimensions
You’ll want to index the dimension key (primary key), which is a surrogate key, not a “natural” or transactional key such as customer name or customer ID. Note that you shouldn’t cluster on the dimension key.

The dimension will contain a natural or transactional key (e.g., a transaction number or identifier), which we’ll call a business key, from the source system. Although the business key might not be unique—as in the case of a type 2 response to slowly changing dimensions—create a clustered index on the identity column. The Customer and the Product dimensions have a clustered index built on the business key. By clustering on this key, you can enhance the query response when the business key is used in the WHERE clause. The expression in the WHERE clause is often used to search the dimensional data, and having the dimension records pre-sorted makes the query response faster.

Clustering by the business key might also help you avoid lock escalation (i.e., row to table, intent-exclusive to exclusive) during the extraction, transformation, and loading (ETL) process, which could happen if the surrogate key was the cluster key and all the rows were being added at the end of the file. If the exclusive locks escalated from row to table, that would block read, other ETL load, or utility operations on the table and cause diminished query response and even application timeouts. If your users and applications can live with some latency, you can get around this problem by having them query database snapshots and reserving the original database exclusively for data loads.

In Figure 1, the Date dimension and the Time dimension have no external datasource or business key. Instead of creating an identity-style primary key for these two tables, consider using a smart key, with a YYYYMMDD format for Date and an HHMMSSSSS format for Time (you can use fewer second positions, depending on how fine a time granularity you need to measure), and clustering on it. The values will maintain index order, range queries will be simplified in the fact table, and you’ll need one less join when querying because the primary key will contain the date (or time).

For large type 2 slowly changing dimensions (i.e., where you add a new row to record the change), you might want to create a four-part non-clustered index that includes the business key, the record begin date, the record end date, and the surrogate key. For efficiency and to prevent escalating storage requirements, INCLUDE the record end date and the surrogate key when creating the index instead of making them part of the index key, as shown in the following command:

CREATE NONCLUSTERED INDEX MyDim_CoveringIndex
  ON (NaturalKEY, RecordStartDate)
INCLUDE ( RecordEndDate, SurrogateKEY);

This command creates a covering index that can be useful during the ETL process and load operations and for historical queries. When you make RecordEnd- Date and SurrogateKEY INCLUDEs instead of part of the index key, the SQL Server engine stores these two values at only the leaf level of the index tree, thus reducing the storage requirements. By having these two columns in the index (i.e., creating a covering index), the SQL Server relational engine can get the data that it needs solely from the index during load and some query operations, without having to access data from the underlying dimension.

If there are other columns in the dimension that will be used continuously for searching, sorting, or grouping, create non-clustered indexes on those columns as you would in a transactional database. If there’s an embedded hierarchy in a dimension, such as the Category- SubCategory-ProductID hierarchy in the Product dimension, then consider indexing the components of the hierarchy if it will enhance query performance and won’t inhibit data loading.

Indexing the Fact Table
Indexing the fact table is similar to indexing a dimension, although you must account for partitioning. You’ll want to index and cluster on the date key or a combined date plus time. Because business intelligence (BI) analysis always seems to involve a date/time component, the fact table will have a date (or datetime) key, and clustering on this key will help with cube-building. Also, if the data records are already stored in date or datetime order, historical queries will have an execution advantage. If the fact table has more than one date or datetime column, cluster on the column that’s used most often for querying or cube-building.

If the fact table is partitioned on the date column, use that column as the clustering key. When you see the same column to create the clustered index that you used to create the partitions and creating the index in the same file group that holds the partitioned fact table, SQL Server will automatically partition the index the same way that the fact table is partitioned (i.e., the index will have the same partitioning function and column as the fact table). When the index is partitioned the same way the fact table is partitioned, the table and its index are said to be aligned, which makes for an optimal operational situation, especially if you anticipate creating additional partitions or making frequent partition switches.

Next, create a non-clustered index on each of the foreign keys in the fact table, and consider combining the foreign key and the date key, in that order, similar to CustomerKEY + DateKEY in Figure 1. Creating a non-clustered key on the foreign keys works especially well if one or more of the associated dimensions is a type 2 slowly changing dimension. Rows with the same foreign key value will be searched in ascending date order, which will enhance the historical query response. Note that you’ll want to retain relational integrity when dealing with the foreign keys .

Modifying Your Indexing Scheme
Over time, your data warehouse will change to accommodate what’s happening in your organization, and you’ll have to modify your indexing scheme. Most data warehouse/BI systems will access these relational tables directly, so you can use tried-and-true transactional methods for tuning indexes, such as evaluating the query and data mix and adjusting it accordingly. If your relational data warehouse is used only to stage SSAS structures, then you might not need any more indexes than those we’ve talked about. SSAS tends to use the same queries over and over again, so you can run the Index Tuning Wizard and tune exactly for that set of queries. Start simple, evaluate thoroughly, and build conservatively when indexing your data warehouse.

 

 

Factors used to determine indexing technique

a)  Characteristics of indexed column
A column has its own characteristics which we can use to choose a proper index.  These characteristics
are given below:


·  Cardinality data: The cardinality data of a column is the number of distinct values in the column.  It
is better to know that the cardinality of an indexed column is low or high since an indexing technique
may work efficiently only with either low cardinality or high cardinality.  
·  Distribution: The distribution of a column is the occurrence frequency of each distinct value of the
column.  The column distribution guides us to determine which index type we should take.    
·  Value range: The range of values of and indexed column guides us to select an appropriate index
type. For example, if the range of a high cardinality column is small, an indexing technique based on
bitmap should be used. Without knowing this information, we might use a B-Tree resulting in a
degradation of system performance.

b)  Understanding the data and the usage in the SQL language
Knowing the columns that will be queried helps us choose appropriate index types for them.  For
example, which columns will likely be a part of the selection list, join constraints, application constraints,
the ORDER BY clause, or the GROUP BY clause?    

Developing a new indexing technique for data warehouse’s queries.
The following are the characteristics that we have to concern with when developing a new indexing
technique:
a)  The index should be small and utilize space efficiently.
b)  The index should be able to operate with other indexes to filter out the records before accessing
raw data.
c)  The index should support ad hoc and complex queries and speed up join operations.
d)  The index should be easy to build (easily dynamically generate), implement and maintain.
 
Evaluation of Existing Indexing Techniques in Data Warehouses  
In data warehouse systems, there are many indexing techniques.  Each existing indexing technique is
suitable for a particular situation.  In this section we describe several indexing techniques being
studied/used in both academic research and industrial applications.  

The B-Tree Index
The B-Tree Index is the default index for most relational database systems. The top most
level of the index is called the root. The lowest level is called the leaf node.  All other levels in between
are called branches. Both the root and branch contain entries that point to the next level in the index.  
Leaf nodes consisting of the index key and pointers pointing to the physical location (i.e., row ids) in
which the corresponding records are stored.

The B-Tree Index is popular in data warehouse applications for high cardinality column such as names
since the space usage of the index is independent of the column cardinality.  However, the B-Tree Index
has characteristics that make them a poor choice for DW’s queries.  First of all, a B-Tree index is of no
value for low cardinality data such as the gender column since it reduces very few numbers of I/Os and
may use more space than the raw indexed column.  Secondly, each B-Tree Index is independent and
thus cannot operate with each other on an index level before going to the primary source.  Finally, the
B-Tree Index fetches the result data ordered by the key values which have unordered row ids, so more
I/O operations and page faults are generated.    

Projection Index
A Projection Index on an indexed column A in a table T stores all values of A in the same order as they
appear in T.  Each row of the Projection Index stores one value of A. The row order of value x in the index is the same as the row order of value x in T. Normally, the queries against a data warehouse retrieve only a few of the table’s columns; so having the Projection Index on these columns reduces tremendously the cost of querying because a single I/O operation may bring more values into memory.  Sybase builds a
Projection Index under the name of FastProjection Index on every column of a table.  

Bitmap Index
The bitmap representation is an alternate method of the row ids representation.  It is simple to represent,
and uses less space- and CPU-efficient than row ids when the number of distinct values of the indexed
column is low.  The indexes improve complex query performance by applying low-cost Boolean
operations such as OR, AND, and NOT in the selection predicate on multiple indexes at one time to
reduce search space before  going to the primary source data.  Many variations of the Bitmap Index
(Pure Bitmap Index,  Encoded Bitmap, etc.) have been introduced, aiming to reduce space requirement
as well as improve query performance.

a) Pure Bitmap Index : Pure Bitmap Index was first introduced and implemented in the
Model 204 DBMS.  It consists of a collect of bitmap vectors each of which is created to represent each
distinct value of the indexed column.  A bit i in a bitmap vector, representing value x, is set to 1 if the
record i in the indexed table contains x.  Figure 3 shows an example of the Pure Bitmap Index on the
package_type column of the PRODUCT table.  The Pure Bitmap Index on this column is the collection of 12 bitmap vectors, says {BA, BB, BC, BD, BE, BF, BG, BH, BI, BJ, BK, and BL}, one for each package type. To answer a query, the bitmap vectors of the values specified in the predicate condition are read into memory.  If there are more than one bitmap vectors read, a Boolean operation will be performed on them before accessing data.  However, the sparsity problem occurs if the Pure Index is built on high cardinality column which then requires more space and query processing time to build and answer a query.  Most of commercial data warehouse products (e.g., Oracle, Sybase, Informix, Red Brick, etc.) implement the Pure Bitmap Index.
 
Encoded Bitmap Index : An Encoded Bitmap Index on a column A of a table T consists of a set of bitmap vectors, a lookup table, and a set of retrieval Boolean functions. Each distinct value of a column A is encoded using a number of bits each of which is stored in a bitmap vector. The lookup table stores the mapping between A and its encoded representation. IBM implements this index in DB2. Comparing with the Pure Bitmap Index, the Encoded Bitmap Index improves the space utilization, and solves sparsity problems.  The size of the Encoded Bitmap Index built on the high cardinality column is less than the Pure Bitmap Index.  Having a well defined encoding scheme, a Boolean operation can perform on the retrieval functions before retrieving the data, and lead to a reduction of the number of bitmap vectors read.  Its performance is degraded with equality queries since we have to search all the bitmap vectors.  The index needs to be rebuilt if we run out of bits to represent new values.

Join Index   
A Join Index is built by translating restrictions on the column value of a dimension table (i.e., the gender
column) to restrictions on a large fact table.  The index is implemented using one of the two representations: row id   or bitmap , depending on the cardinality of the indexed column.  A bitmap representation, which is called Bitmap Join Index, is used with the low cardinality data while a row id representation is used with a high cardinality.  In DW, there are many join operations involved; so building Join Indexes on the joining columns improves query-processing time.  
For example, Bitmap Join Indexes on the gender column in the SALE table can be built by using the gender column in the CUSTOMER table and the foreign key customer id in the SALES table. Note that the Sales table does not contain the gender column. The Bitmap Join Index for gender equal to male is created by setting a bit corresponding to a row for customer_id whose gender is ‘M’ to 1 in the Sales
 
If a bitmap vector is built by translating restrictions on the column values from several joined tables at once (e.g. gender and product type in the different dimension tables) then it is called a Multiple Bitmap Join Index.  


Using Bitmap Indexes in Data Warehouses

Bitmap indexes are widely used in data warehousing environments. The environments typically have large amounts of data and ad hoc queries, but a low level of concurrent DML transactions. For such applications, bitmap indexing provides:

  • Reduced response time for large classes of ad hoc queries.

  • Reduced storage requirements compared to other indexing techniques.

  • Dramatic performance gains even on hardware with a relatively small number of CPUs or a small amount of memory.

  • Efficient maintenance during parallel DML and loads.

Fully indexing a large table with a traditional B-tree index can be prohibitively expensive in terms of disk space because the indexes can be several times larger than the data in the table. Bitmap indexes are typically only a fraction of the size of the indexed data in the table.

An index provides pointers to the rows in a table that contain a given key value. A regular index stores a list of rowids for each key corresponding to the rows with that key value. In a bitmap index, a bitmap for each key value replaces a list of rowids.

Each bit in the bitmap corresponds to a possible rowid, and if the bit is set, it means that the row with the corresponding rowid contains the key value. A mapping function converts the bit position to an actual rowid, so that the bitmap index provides the same functionality as a regular index. Bitmap indexes store the bitmaps in a compressed way. If the number of distinct key values is small, bitmap indexes compress better and the space saving benefit compared to a B-tree index becomes even better.

Bitmap indexes are most effective for queries that contain multiple conditions in the WHEREclause. Rows that satisfy some, but not all, conditions are filtered out before the table itself is accessed. This improves response time, often dramatically. If you are unsure of which indexes to create, the SQL Access Advisor can generate recommendations on what to create. As the bitmaps from bitmap indexes can be combined quickly, it is usually best to use single-column bitmap indexes.

When creating bitmap indexes, you should use NOLOGGINGand COMPUTESTATISTICS. In addition, you should keep in mind that bitmap indexes are usually easier to destroy and re-create than to maintain.

Benefits for Data Warehousing Applications

Bitmap indexes are primarily intended for data warehousing applications where users query the data rather than update it. They are not suitable for OLTP applications with large numbers of concurrent transactions modifying the data.

Parallel query and parallel DML work with bitmap indexes. Bitmap indexing also supports parallel create indexes and concatenated indexes.

Cardinality

The advantages of using bitmap indexes are greatest for columns in which the ratio of the number of distinct values to the number of rows in the table is small. We refer to this ratio as the degree of cardinality. A gender column, which has only two distinct values (male and female), is optimal for a bitmap index. However, data warehouse administrators also build bitmap indexes on columns with higher cardinalities.

For example, on a table with one million rows, a column with 10,000 distinct values is a candidate for a bitmap index. A bitmap index on this column can outperform a B-tree index, particularly when this column is often queried in conjunction with other indexed columns. In fact, in a typical data warehouse environments, a bitmap index can be considered for any non-unique column.

B-tree indexes are most effective for high-cardinality data: that is, for data with many possible values, such as customer_nameor phone_number. In a data warehouse, B-tree indexes should be used only for unique columns or other columns with very high cardinalities (that is, columns that are almost unique). The majority of indexes in a data warehouse should be bitmap indexes.

In ad hoc queries and similar situations, bitmap indexes can dramatically improve query performance. ANDand ORconditions in the WHEREclause of a query can be resolved quickly by performing the corresponding Boolean operations directly on the bitmaps before converting the resulting bitmap to rowids. If the resulting number of rows is small, the query can be answered quickly without resorting to a full table scan.

Example 6-1 Bitmap Index

The following shows a portion of a company's customerstable.

SELECT cust_id, cust_gender, cust_marital_status, cust_income_level
FROM customers;

CUST_ID    C CUST_MARITAL_STATUS  CUST_INCOME_LEVEL
---------- - -------------------- ---------------------
... 
        70 F                      D: 70,000 - 89,999
        80 F married              H: 150,000 - 169,999
        90 M single               H: 150,000 - 169,999
       100 F                      I: 170,000 - 189,999
       110 F married              C: 50,000 - 69,999
       120 M single               F: 110,000 - 129,999
       130 M                      J: 190,000 - 249,999
       140 M married              G: 130,000 - 149,999
...

Because cust_gender, cust_marital_status, and cust_income_levelare all low-cardinality columns (there are only three possible values for marital status, two possible values for gender, and 12 for income level), bitmap indexes are ideal for these columns. Do not create a bitmap index on cust_idbecause this is a unique column. Instead, a unique B-tree index on this column provides the most efficient representation and retrieval.

Table 6-1 illustrates the bitmap index for the cust_gendercolumn in this example. It consists of two separate bitmaps, one for gender.

Table 6-1 Sample Bitmap Index

  gender='M' gender='F'

cust_id 70

0

1

cust_id 80

0

1

cust_id 90

1

0

cust_id 100

0

1

cust_id 110

0

1

cust_id 120

1

0

cust_id 130

1

0

cust_id 140

1

0

Each entry (or bit) in the bitmap corresponds to a single row of the customerstable. The value of each bit depends upon the values of the corresponding row in the table. For example, the bitmap cust_gender='F'contains a one as its first bit because the gender is Fin the first row of the customerstable. The bitmap cust_gender='F'has a zero for its third bit because the gender of the third row is not F.

An analyst investigating demographic trends of the company's customers might ask, "How many of our married customers have an income level of G or H?" This corresponds to the following query:

SELECT COUNT(*) FROM customers
WHERE cust_marital_status = 'married' 
AND cust_income_level IN ('H: 150,000 - 169,999', 'G: 130,000 - 149,999');

Bitmap indexes can efficiently process this query by merely counting the number of ones in the bitmap illustrated in Figure 6-1. The result set will be found by using bitmap ORmerge operations without the necessity of a conversion to rowids. To identify additional specific customer attributes that satisfy the criteria, use the resulting bitmap to access the table after a bitmap to rowid conversion.

Figure 6-1 Executing a Query Using Bitmap Indexes

Description of Figure 6-1 follows
Description of "Figure 6-1 Executing a Query Using Bitmap Indexes"
 

How to Determine Candidates for Using a Bitmap Index

Bitmap indexes should help when either the fact table is queried alone, and there are predicates on the indexed column, or when the fact table is joined with two or more dimension tables, and there are indexes on foreign key columns in the fact table, and predicates on dimension table columns.

A fact table column is a candidate for a bitmap index when the following conditions are met:

  • There are 100 or more rows for each distinct value in the indexed column. When this limit is met, the bitmap index will be much smaller than a regular index, and you will be able to create the index much faster than a regular index. An example would be one million distinct values in a multi-billion row table.

And either of the following are true:

  • The indexed column will be restricted in queries (referenced in the WHEREclause).

or

  • The indexed column is a foreign key for a dimension table. In this case, such an index will make star transformation more likely.

Bitmap Indexes and Nulls

Unlike most other types of indexes, bitmap indexes include rows that have NULLvalues. Indexing of nulls can be useful for some types of SQL statements, such as queries with the aggregate function COUNT.

Example 6-2 Bitmap Index

SELECT COUNT(*) FROM customers WHERE cust_marital_status IS NULL;

This query uses a bitmap index on cust_marital_status. Note that this query would not be able to use a B-tree index, because B-tree indexes do not store the NULLvalues.

SELECT COUNT(*) FROM customers;

Any bitmap index can be used for this query because all table rows are indexed, including those that have NULLdata. If nulls were not indexed, the optimizer would be able to use indexes only on columns with NOT NULLconstraints.

Bitmap Indexes on Partitioned Tables

You can create bitmap indexes on partitioned tables but they must be local to the partitioned table—they cannot be global indexes. A partitioned table can only have global B-tree indexes, partitioned or non-partitioned. 

Using Bitmap Join Indexes in Data Warehouses

In addition to a bitmap index on a single table, you can create a bitmap join index, which is a bitmap index for the join of two or more tables. In a bitmap join index, the bitmap for the table to be indexed is built for values coming from the joined tables. In a data warehousing environment, the join condition is an equi-inner join between the primary key column or columns of the dimension tables and the foreign key column or columns in the fact table.

A bitmap join index can improve the performance by an order of magnitude. By storing the result of a join, the join can be avoided completely for SQL statements using a bitmap join index. Furthermore, since it is most likely to have a much smaller number of distinct values for a bitmap join index compared to a regular bitmap index on the join column, the bitmaps compress better, yielding to less space consumption than a regular bitmap index on the join column.

Bitmap join indexes are much more efficient in storage than materialized join views, an alternative for materializing joins in advance. This is because the materialized join views do not compress the rowids of the fact tables.

Four Join Models for Bitmap Join Indexes

The most common usage of a bitmap join index is in star model environments, where a large table is indexed on columns joined by one or several smaller tables. We will refer to the large table as the fact table and to the smaller tables as dimension tables. The following section describes the four different join models supported by bitmap join indexes. 

Example 6-3 Bitmap Join Index: One Dimension Table Columns Joins One Fact Table

Unlike the example in "Bitmap Index", where a bitmap index on the cust_gendercolumn on the customerstable was built, we now create a bitmap join index on the fact table salesfor the joined column customers(cust_gender). Table salesstores cust_idvalues only:

SELECT time_id, cust_id, amount_sold FROM sales;

TIME_ID   CUST_ID    AMOUNT_SOLD
--------- ---------- -----------
01-JAN-98      29700        2291
01-JAN-98       3380         114
01-JAN-98      67830         553
01-JAN-98     179330           0
01-JAN-98     127520         195
01-JAN-98      33030         280
...

To create such a bitmap join index, column customers(cust_gender)has to be joined with table sales. The join condition is specified as part of the CREATEstatement for the bitmap join index as follows:

CREATE BITMAP INDEX sales_cust_gender_bjix
ON sales(customers.cust_gender)
FROM sales, customers
WHERE sales.cust_id = customers.cust_id
LOCAL NOLOGGING COMPUTE STATISTICS;

The following query shows illustrates the join result that is used to create the bitmaps that are stored in the bitmap join index:

SELECT sales.time_id, customers.cust_gender, sales.amount_sold
FROM sales, customers
WHERE sales.cust_id = customers.cust_id;

TIME_ID   C AMOUNT_SOLD
--------- - -----------
01-JAN-98 M        2291
01-JAN-98 F         114
01-JAN-98 M         553
01-JAN-98 M           0
01-JAN-98 M         195
01-JAN-98 M         280
01-JAN-98 M          32
...

Table 6-2 illustrates the bitmap representation for the bitmap join index in this example.

Table 6-2 Sample Bitmap Join Index

  cust_gender='M' cust_gender='F'

sales record 1

1

0

sales record 2

0

1

sales record 3

1

0

sales record 4

1

0

sales record 5

1

0

sales record 6

1

0

sales record 7

1

0

You can create other bitmap join indexes using more than one column or more than one table, as shown in these examples.

Example 6-4 Bitmap Join Index: Multiple Dimension Columns Join One Fact Table

You can create a bitmap join index on more than one column from a single dimension table, as in the following example, which uses customers(cust_gender, cust_marital_status)from the shschema:

CREATE BITMAP INDEX sales_cust_gender_ms_bjix
ON sales(customers.cust_gender, customers.cust_marital_status)
FROM sales, customers
WHERE sales.cust_id = customers.cust_id
LOCAL NOLOGGING COMPUTE STATISTICS;

Example 6-5 Bitmap Join Index: Multiple Dimension Tables Join One Fact Table

You can create a bitmap join index on multiple dimension tables, as in the following, which uses customers(gender)and products(category):

CREATE BITMAP INDEX sales_c_gender_p_cat_bjix
ON sales(customers.cust_gender, products.prod_category)
FROM sales, customers, products
WHERE sales.cust_id = customers.cust_id
AND sales.prod_id = products.prod_id
LOCAL NOLOGGING COMPUTE STATISTICS;

Example 6-6 Bitmap Join Index: Snowflake Schema

You can create a bitmap join index on more than one table, in which the indexed column is joined to the indexed table by using another table. For example, you can build an index on countries.country_name, even though the countriestable is not joined directly to the salestable. Instead, the countriestable is joined to the customerstable, which is joined to the salestable. This type of schema is commonly called a snowflake schema.

CREATE BITMAP INDEX sales_co_country_name_bjix
ON sales(countries.country_name)
FROM sales, customers, countries
WHERE sales.cust_id = customers.cust_id
  AND customers.country_id = countries.country_id
LOCAL NOLOGGING COMPUTE STATISTICS;

Bitmap Join Index Restrictions and Requirements

Join results must be stored, therefore, bitmap join indexes have the following restrictions:

  • Parallel DML is only supported on the fact table. Parallel DML on one of the participating dimension tables will mark the index as unusable.

  • Only one table can be updated concurrently by different transactions when using the bitmap join index.

  • No table can appear twice in the join.

  • You cannot create a bitmap join index on an index-organized table or a temporary table.

  • The columns in the index must all be columns of the dimension tables.

  • The dimension table join columns must be either primary key columns or have unique constraints.

  • The dimension table column(s) participating the join with the fact table must be either the primary key column(s) or with the unique constraint.

  • If a dimension table has composite primary key, each column in the primary key must be part of the join.

  • The restrictions for creating a regular bitmap index also apply to a bitmap join index. For example, you cannot create a bitmap index with the UNIQUEattribute. 

Using B-Tree Indexes in Data Warehouses

A B-tree index is organized like an upside-down tree. The bottom level of the index holds the actual data values and pointers to the corresponding rows, much as the index in a book has a page number associated with each index entry.

In general, use B-tree indexes when you know that your typical query refers to the indexed column and retrieves a few rows. In these queries, it is faster to find the rows by looking at the index. However, using the book index analogy, if you plan to look at every single topic in a book, you might not want to look in the index for the topic and then look up the page. It might be faster to read through every chapter in the book. Similarly, if you are retrieving most of the rows in a table, it might not make sense to look up the index to find the table rows. Instead, you might want to read or scan the table.

B-tree indexes are most commonly used in a data warehouse to enforce unique keys. In many cases, it may not even be necessary to index these columns in a data warehouse, because the uniqueness was enforced as part of the preceding ETL processing, and because typical data warehouse queries may not work better with such indexes. B-tree indexes are more common in environments using third normal form schemas. In general, bitmap indexes should be more common than B-tree indexes in most data warehouse environments.

Using Index Compression

Bitmap indexes are always stored in a patented, compressed manner without the need of any user intervention. B-tree indexes, however, can be stored specifically in a compressed manner to enable huge space savings, storing more keys in each index block, which also leads to less I/O and better performance.

Key compression lets you compress a B-tree index, which reduces the storage overhead of repeated values. In the case of a nonunique index, all index columns can be stored in a compressed format, whereas in the case of a unique index, at least one index column has to be stored uncompressed.

Generally, keys in an index have two pieces, a grouping piece and a unique piece. If the key is not defined to have a unique piece, Oracle provides one in the form of a rowid appended to the grouping piece. Key compression is a method of breaking off the grouping piece and storing it so it can be shared by multiple unique pieces. The cardinality of the chosen columns to be compressed determines the compression ratio that can be achieved. So, for example, if a unique index that consists of five columns provides the uniqueness mostly by the last two columns, it is most optimal to choose the three leading columns to be stored compressed. If you choose to compress four columns, the repetitiveness will be almost gone, and the compression ratio will be worse.

Although key compression reduces the storage requirements of an index, it can increase the CPU time required to reconstruct the key column values during an index scan. It also incurs some additional storage overhead, because every prefix entry has an overhead of four bytes associated with it.

Choosing Between Local Indexes and Global Indexes

B-tree indexes on partitioned tables can be global or local. With Oracle8i and earlier releases, Oracle recommended that global indexes not be used in data warehouse environments because a partition DDL statement (for example, ALTERTABLE... DROPPARTITION) would invalidate the entire index, and rebuilding the index is expensive. In Oracle Database 10g, global indexes can be maintained without Oracle marking them as unusable after DDL. This enhancement makes global indexes more effective for data warehouse environments.

However, local indexes will be more common than global indexes. Global indexes should be used when there is a specific requirement which cannot be met by local indexes (for example, a unique index on a non-partitioning key, or a performance requirement).

Bitmap indexes on partitioned tables are always local.

 For Support