Certified Data Mining and Warehousing Professional Management and trends

Management and trends

Data warehousing and business intelligence metadata is best managed through a combination of people, process and tools.

The people side requires that people be trained in the importance and use of metadata.  They need to understand how and when to use tools as well as the benefits to be gained through metadata.

The process side incorporates metadata management into the data warehousing and business intelligence life cycle.  As the life cycle progresses metadata is entered into the appropriate tool and stored in a metadata repository for further use.

Metadata can be managed through individual tools:

  • Metadata manager / repository
  • Metadata extract tools
  • Data modeling
  • ETL
  • BI Reporting

Metadata can be managed through a shared repository that combines information from multiple sources.

The metadata manager can be purchased as a software package or built as "home grown" system.  Many organizations start with a spreadsheet containing data definitions and then grow to a more sophisticated approach.

Extracting Metadata from Input Sources

Metadata can be obtained through a manual process of keying in metadata or through automated processes. Scanners can extract metadata from text such as SQL DDL or COBOL programs. Other tools can directly access metadata through SQL catalogs and other metadata sources.

Picking the appropriate metadata extract tools is a key part of metadata management.

Many data modeling tools include a metadata extract capability - otherwise known as "reverse engineering".  Through this tool, database information about tables and columns can be extracted.  The information can then be exported from the data modeling tool to the metadata manager.


Meta-data management (also known as metadata management, without the hyphen) involves storing information about other information. With different types of media being used, references to the location of the data can allow management of diverse repositories.

Metadata management can be defined as the end-to-end process and governance framework for creating, controlling, enhancing, attributing, defining and managing a metadata schema, model or other structured aggregation system, either independently or within a repository and the associated supporting processes (often to enable the management of content).

URLs, images, video etc. may be referenced from a triples table of object, attribute and value.

With specific knowledge domains, the boundaries of the metadata for each must be managed, since a general ontology is not useful to experts in one field whose language is knowledge-domain specific.

If one is in the process of making a knowledge management solution, creating a metadata schema and developing a system in which metadata is managed is very important. In such a project, a dedicated metadata manager may be appointed in order to maintain adherence to metadata and information management standards. This is a person who will be responsible for the metadata strategy, and possibly, the implementation. A metadata manager does not need to know about and be involved with everything concerning the solution, but it does help to have an understanding of as much of the process as possible to make sure a relevant schema is developed. Managing the metadata in a knowledge management solution is an important step in a metadata strategy. It is part of the strategy to make sure that the metadata are complete, current and correct at any given time. Managing a metadata project is also about making sure that users of the system are aware of the possibilities allowed by a well designed metadata system and how to maximize the benefits of metadata. Regular monitoring the metadata to ensure that the schema remains relevant is advised.

Metadata storage

Metadata can be stored either internally, in the same file as the data, or externally, in a separate file. Metadata that is embedded with content is called embedded metadata. A data repository typically stores the metadata detached from the data. Both ways have advantages and disadvantages:

  • Internal storage allows transferring metadata together with the data it describes; thus, metadata is always at hand and can be manipulated easily. This method creates high redundancy and does not allow holding metadata together.
  • External storage allows bundling metadata, for example in a database, for more efficient searching. There is no redundancy and metadata can be transferred simultaneously when using streaming. However, as most formats use URIs for that purpose, the method of how the metadata is linked to its data should be treated with care. What if a resource does not have a URI (resources on a local hard disk or web pages that are created on-the-fly using a content management system)? What if metadata can only be evaluated if there is a connection to the Web, especially when using RDF? How to realize that a resource is replaced by another with the same name but different content?

Moreover, there is the question of data format: storing metadata in a human-readable format such as XML can be useful because users can understand and edit it without specialized tools. On the other hand, these formats are not optimized for storage capacity; it may be useful to store metadata in a binary, non-human-readable format instead to speed up transfer and save memory.


Metadata functions

  • Resource discovery
    • Allowing resources to be found by relevant criteria;
    • Identifying resources;
    • Bringing similar resources together;
    • Distinguishing dissimilar resources;
    • Giving location information.
  • Organizing e-resources
    • Organizing links to resources based on audience or topic.
    • Building these pages dynamically from metadata stored in databases.
  • Facilitating interoperability
    • Using defined metadata schemes, shared transfer protocols, and crosswalks between schemes, resources across the network can be searched more seamlessly.
      • Cross-system search, e.g., using Z39.50 protocol;
      • Metadata harvesting, e.g., OAI protocol.
  • Digital identification
    • Elements for standard numbers, e.g., ISBN
    • The location of a digital object may also be given using:
      • a file name
      • a URL
      • some persistent identifiers, e.g., PURL (Persistent URL); DOI (Digital Object Identifier)
    • Combined metadata to act as a set of identifying data, differentiating one object from another for validation purposes.
  • Archiving and preservation
    • Challenges:
      • Digital information is fragile and can be corrupted or altered;
      • It may become unusable as storage technologies change.
    • Metadata is key to ensuring that resources will survive and continue to be accessible into the future. Archiving and preservation require special elements:
      • to track the lineage of a digital object,
      • to detail its physical characteristics, and
      • to document its behavior in order to emulate it in future technologies.


Proliferation of Data Sources

The number of enterprise data sources is growing rapidly, with new types of sources emerging every year. The most exciting new source is, of course, enterprise e-business operations. Enterprises want to integrate clickstream data from their Web sites with other internal data in order to get a complete picture of their customers and integrate internal processes. Other sources for valuable data include ERP programs, operational data stores, packaged and homegrown analytic applications and existing data marts. The process of integrating these sources into one data warehouse can be complicated and is made even more difficult when an enterprise merges with or acquires another enterprise.

Enterprises also look to a growing number of external sources to supplement their internal data. These might include prospect lists, demographic and psychographic data, and business profiles purchased from third-party providers. Enterprises might also want to use an external provider for help with address verification, where internal company sources are compared with a master list to ensure data accuracy. Additionally, some industries have their own specific sources of external data. For example, the retail industry uses data from store scanners, and the pharmaceutical industry uses prescription data that is aggregated by third- party vendors.

Hub Versus Relational Databases

In an effort to control costs and improve performance, enterprises are increasingly implementing data hubs in their data warehouses instead of using relational databases. Keeping data in a relational database can be quite expensive, costing three to five times more than keeping data in a nonrelational repository. Additionally, queries on nonrelational data stores can outperform queries on relational databases. In hopes of achieving these benefits, enterprises are turning to compressed flat files to replace some of their RDBMSs. Despite the performance benefits and cost-effectiveness of these data hubs, they are limited by not having SQL and are not appropriate for all situations.

Active Data Warehouses

As enterprises face competitive pressure to increase the speed of decision making, the data warehouse must evolve to support real-time analysis and action. "Active" data warehouses are one way to meet this need. In contrast to traditional data warehouses, active data warehouses are tied closely to operational systems, are designed to hold very detailed and current data, and feature shortened batch windows. And unlike most operational data stores (ODS), active data warehouses hold integrated data and are open to user queries. All of the aforementioned characteristics make active data warehouses ideal for real-time analysis and decision-making as well as automated event triggering.

With an active data warehouse, an enterprise can respond to customer interactions and changing business conditions in real time. An active data warehouse enables a credit card company to detect and stop fraud as it happens, a transportation company to reroute its vehicles quickly and efficiently or an online retailer to communicate special offers based on a customer’s Web surfing behavior. The active data warehouse’s greatest benefit lies in its ability to support tactical as well as strategic decisions.

Growing Number of End Users

As vendors make data warehousing and business intelligence tools more accessible to the masses, the number of data warehousing end users is growing rapidly. Survey.com predicts that the number of data warehouse users will more than quadruple by 2002, with an average of 2,718 individual users and 609 concurrent users per warehouse. In addition to coping with the growth in the number of end users, data warehousing teams will need to cater to different types of end users. In a recent study, Gartner found that the use of business intelligence tools is growing most rapidly among administration and operations personnel, followed closely by executive-level personnel. These findings demonstrate that business intelligence tools have become both easier to use and more strategic. Obviously, end users will have different needs depending on their position in the company – while the business analyst needs ad hoc querying capabilities, the CEO and COO may only want static reporting.

Enterprises can handle the growing number of end users through the use of several techniques including parallelism and scalability, optimized data partitioning, aggregates, cached result sets and single-mission data marts. These techniques allow a large number of employees to concurrently access the data warehouse without compromising performance. Accommodating the different needs of various end-user groups will require as much of an organizational solution as a technical one. Data warehousing teams should involve end users from the beginning in order to determine the types of data and applications necessary to meet their decision-making needs.

More Complex Queries

In addition to becoming more numerous, queries against the data warehouse will also become more complex. User expectations are growing in terms of the ability to get exactly the type of information needed, when it’s needed. Simple data aggregation is no longer enough to satisfy users who want to be able to drill down on multiple dimensions. For example, it may not be enough to deliver a regional sales report every week. Users may want to look at the data by customized dimensions – perhaps by a certain customer characteristic, a specific sales location or the time of purchase.

Users are also demanding more sophisticated business intelligence tools. According to Gartner, data mining is the most rapidly growing business intelligence technology. Other sophisticated technologies are also becoming more popular. Vendors are developing software that can monitor data repositories and trigger reactions to events on a real-time basis. For example, if a telecom customer calls to cancel his call-waiting feature, real-time analytic software can detect this and trigger a special offer of a lower price in order to retain the customer. Vendors are also developing a new generation of data mining algorithms, featuring predictive power combined with explanatory components, robustness and self-learning features. These new algorithms automate data mining and make it more accessible to mainstream users by providing explanations with results, indicating when results are not reliable and automatically adapting to changes in underlying predictive models and/or data structures.

Enterprises can handle complex queries and the demands of advanced analytic technologies by implementing some of the same techniques used to handle the increasing number of users, including parallelism. These techniques ensure that complex queries will not compromise data warehouse performance. In trying to meet end-user demands, enterprises will also need to address data warehouse availability. In global organizations, users need 24x7 uptime in order to get the information they need. In enterprises with moderate data volumes, high availability is easily implemented with high redundancy levels. In enterprises with large data volumes, however, systems must be carefully engineered for robustness through the use of well-designed parallel frameworks.

Exploding Data Volumes

One of the biggest technology issues facing enterprises today is the explosion in data volumes that is expected to occur over the next several years. According to Gartner, in 2004 enterprises will be managing 30 times more data than in 1999. And Survey.com found that the amount of usable data in the average data warehouse will increase 290 percent to more than 1.2 terabytes in 2002. E-business is one of the primary culprits in the data explosion, as clickstream data is expected to quickly add terabytes to the data warehouse. As the number of other customer contact channels grows, they add even more data. Escalating end-user demands also play a part, as organizations collect more information and store it for longer periods.

The data explosion creates extreme scalability challenges for enterprises. A truly scalable data warehouse will allow an enterprise to accommodate increasing data volumes by simply adding more hardware. Scalable data warehouses typically rely on parallel technology frameworks. Fortunately, lower hardware costs are making parallel technology more accessible. Distributed memory parallel processor (DMPP) hardware is becoming less expensive, and alternatives to DMPP are also improving – server clustering (of SMPs) is evolving as a substitute. Additionally, storage costs continue to decline every year, making it possible for enterprises to keep terabytes of detailed historical data.

This article has looked at some of the major challenges in data warehousing today. Hopefully this list provides some food for thought for those involved in data warehousing initiatives and encourages you to examine the way these trends and issues affect your own organizations. While this article has presented only brief suggestions for dealing with data warehousing challenges, I hope readers will use these suggestions as a springboard to further exploration of available solutions.

 For Support