Certified Data Mining and Warehousing Professional Components and Metadata role

Components and Metadata role
 


The architecture of a typical data mining system may have the following major components :

        Database, data warehouse, WorldWideWeb, or other information repository:

  1. This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories.
  2. Data cleaning and data integration techniques may be performed on the data.

  Database or data warehouse server:

 

  1. The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request.

  Knowledge base:

 

  1. This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns.
  2. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction.
  3. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included.
  4. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).

  Data mining engine:

 

  1. This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.

  Pattern evaluation module:

 

  1. This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns.
  2. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used.
  3. For efficient data mining, it is highly recommended to push the evaluation of pattern interestingness as deep as possible into the mining process so as to confine the search to only the interesting patterns.

  User interface:

 

  1. This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results.
  2. Also, it allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms.

 


DATA WAREHOUSE COMPONENTS & ARCHITECTURE

The data in a data warehouse comes from operational systems of the organization as well as from other external sources. These are collectively referred to as source systems.  The data extracted from source systems is stored in a area called data staging area, where the data is cleaned, transformed, combined, deduplicated to prepare the data for us in the data warehouse. The data staging area is generally a collection of machines where simple activities like sorting and sequential processing takes place. The data staging area does not provide any query or presentation services. As soon as a system provides query or presentation services, it is categorized as a presentation server. A presentation server is the target machine on which the data is loaded from the data staging area organized and stored for direct querying by end users, report writers and other applications. The three different kinds of systems that are required for a data warehouse are:

  1. Source Systems
  2. Data Staging Area
  3. Presentation servers

 

The data travels from source systems to presentation servers via the data staging area. The entire process is popularly known as ETL (extract, transform, and load) or ETT (extract, transform, and transfer). Oracle’s ETL tool is called Oracle Warehouse Builder (OWB) and MS SQL Server’s ETL tool is called Data Transformation Services (DTS).
A typical architecture of a data warehouse is shown below:

  

Each component and the tasks performed by them are explained below:
           

  • OPERATIONAL DATA

The sources of data for the data warehouse is supplied from:

    • The data from the mainframe systems in the traditional network and hierarchical format.
    • Data can also come from the relational DBMS like Oracle, Informix.
    • In addition to these internal data, operational data also includes external data obtained from commercial databases and databases associated with supplier and customers.
  • LOAD MANAGER

The load manager performs all the operations associated with extraction and loading data into the data warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse. The size and complexity of this component will vary between data warehouses and may be constructed using a combination of vendor data loading tools and custom built programs.

  • WAREHOUSE MANAGER

The warehouse manager performs all the operations associated with the management of data in the warehouse. This component is built using vendor data management tools and custom built programs. The operations performed by warehouse manager include:

    • Analysis of data to ensure consistency
    • Transformation and merging the source data from temporary storage into data warehouse tables
    • Create indexes and views on the base table.
    • Denormalization
    • Generation of aggregation
    • Backing up and archiving of data

In certain situations, the warehouse manager also generates query profiles to determine which indexes ands aggregations are appropriate.

  • QUERY MANAGER

The query manager performs all operations associated with management of user queries. This component is usually constructed using vendor end-user access tools, data warehousing monitoring tools, database facilities and custom built programs. The complexity of a query manager is determined by facilities provided by the end-user access tools and database.

  • DETAILED DATA

This area of the warehouse stores all the detailed data in the database schema. In most cases detailed data is not stored online but aggregated to the next level of details. However the detailed data is added regularly to the warehouse to supplement the aggregated data.

  • LIGHTLY AND HIGHLY SUMMERIZED DATA

The area of the data warehouse stores all the predefined lightly and highly summarized (aggregated) data generated by the warehouse manager. This area of the warehouse is transient as it will be subject to change on an ongoing basis in order to respond to the changing query profiles. The purpose of the summarized information is to speed up the query performance. The summarized data is updated continuously as new data is loaded into the warehouse.

  • ARCHIVE AND BACK UP DATA

This area of the warehouse stores detailed and summarized data for the purpose of archiving and back up. The data is transferred to storage archives such as magnetic tapes or optical disks.

  • META DATA

The data warehouse also stores all the Meta data (data about data) definitions used by all processes in the warehouse. It is used for variety of purposed including:

    • The extraction and loading process – Meta data is used to map data sources to a common view of information within the warehouse.
    • The warehouse management process – Meta data is used to automate the production of summary tables.
    • As part of Query Management process Meta data is used to direct a query to the most appropriate data source.

The structure of Meta data will differ in each process, because the purpose is different. More about Meta data will be discussed in the later Lecture Notes.

  • END-USER ACCESS TOOLS

The principal purpose of data warehouse is to provide information to the business managers for strategic decision-making. These users interact with the warehouse using end user access tools. The examples of some of the end user access tools can be:

    • Reporting and Query Tools
    • Application Development Tools
    • Executive Information Systems Tools
    • Online Analytical Processing Tools
    • Data Mining Tools

THE E T L (EXTRACT TRANSFORMATION LOAD) PROCESS
In this section we will discussed about the 4 major process of the data warehouse. They are extract (data from the operational systems and bring it to the data warehouse),transform(the data into internal format and structure of the data warehouse),cleanse (to make sure it is of sufficient quality to be used for decision making) and load (cleanse data is put into the data warehouse).
The four processes from extraction through loading often referred collectively as Data Staging.

EXTRACT

Some of the data elements in the operational database can be reasonably be expected to be useful in the decision making, but others are of less value for that purpose. For this reason, it is necessary to extract the relevant data from the operational database before bringing into the data warehouse. Many commercial tools are available to help with the extraction process. Data Junction is one of the commercial products. The user of one of these tools typically has an easy-to-use windowed interface by which to specify the following:

    • Which files and tables are to be accessed in the source database?
    • Which fields are to be extracted from them? This is often done internally by SQL Select statement.
    • What are those to be called in the resulting database?
    • What is the target machine and database format of the output?
    • On what schedule should the extraction process be repeated?

 

TRANSFORM
The operational databases developed can be based on any set of priorities, which keeps changing with the requirements. Therefore those who develop data warehouse based on these databases are typically faced with inconsistency among their data sources. Transformation process deals with rectifying any inconsistency  (if any).
One of the most common transformation issues is ‘Attribute Naming Inconsistency’. It is common for the given data element to be referred to by different data names in different databases. Employee Name may be EMP_NAME in one database, ENAME in the other. Thus one set of Data Names are picked and used consistently in the data warehouse. Once all the data elements have right names, they must be converted to common formats. The conversion may encompass the following:

  • Characters must be converted ASCII to EBCDIC or vise versa.
  • Mixed Text may be converted to all uppercase for consistency.
  • Numerical data must be converted in to a common format.
  • Data Format has to be standardized.
  • Measurement may have to convert. (Rs/ $)
  • Coded data (Male/ Female, M/F) must be converted into a common format.

All these transformation activities are automated and many commercial products are available to perform the tasks. DataMAPPER from Applied Database Technologies is one such comprehensive tool.

CLEANSING

Information quality is the key consideration in determining the value of the information. The developer of the data warehouse is not usually in a position to change the quality of its underlying historic data, though a data warehousing project can put spotlight on the data quality issues and lead to improvements for the future. It is, therefore, usually necessary to go through the data entered into the data warehouse and make it as error free as possible. This process is known as Data Cleansing.
Data Cleansing must deal with many types of possible errors. These include missing data and incorrect data at one source; inconsistent data and conflicting data when two or more source are involved. There are several algorithms followed to clean the data, which will be discussed in the coming lecture notes.

LOADING

Loading often implies physical movement of the data from the computer(s) storing the source database(s) to that which will store the data warehouse database, assuming it is different. This takes place immediately after the extraction phase. The most common channel for data movement is a high-speed communication link. Ex: Oracle Warehouse Builder is the API from Oracle, which provides the features to perform the ETL task on Oracle Data Warehouse.


METADATA

The term metadata is ambiguous, as it is used for two fundamentally different concepts (types). Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at design time the application contains no data. In this case the correct description would be "data about the containers of data". Descriptive metadata, on the other hand, is about individual instances of application data, the data content. In this case, a useful description (resulting in a disambiguating neologism) would be "data about data content" or "content about content" thus metacontent. Descriptive, Guide and the National Information Standards Organization concept of administrative metadata are all subtypes of metacontent.

Metadata (metacontent) is traditionally found in the card catalogs of libraries. As information has become increasingly digital, metadata is also used to describe digital data using metadata standards specific to a particular discipline. By describing the contents and context of data files, the quality of the original data/files is greatly increased. For example, a webpage may include metadata specifying what language it is written in, what tools were used to create it, and where to go for more on the subject, allowing browsers to automatically improve the experience of users.

Metadata (metacontent) is defined as data providing information about one or more aspects of the data, such as:

  • Means of creation of the data
  • Purpose of the data
  • Time and date of creation
  • Creator or author of data
  • Location on a computer network where the data was created
  • Standards used

Metadata and data warehousing

Data warehouse (DW) is a repository of an organization's electronically stored data. Data warehouses are designed to manage and store the data whereas the Business Intelligence (BI) focuses on the usage of data to facilitate reporting and analysis.

The purpose of a data warehouse is to house standardized, structured, consistent, integrated, correct, cleansed and timely data, extracted from various operational systems in an organization. The extracted data is integrated in the data warehouse environment in order to provide an enterprise wide perspective, one version of the truth. Data is structured in a way to specifically address the reporting and analytic requirements. The design of structural metadata commonality using a data modeling method such as entity relationship model diagraming is very important in any data warehouse development effort.

An essential component of a data warehouse/business intelligence system is the metadata and tools to manage and retrieve metadata. Ralph Kimball describes metadata as the DNA of the data warehouse as metadata defines the elements of the data warehouse and how they work together.

Kimball et al. refers to three main categories of metadata: Technical metadata, business metadata and process metadata. Technical metadata is primarily definitional while business metadata and process metadata are primarily descriptive. Keep in mind that the categories sometimes overlap.

  • Technical metadata defines the objects and processes in a DW/BI system, as seen from a technical point of view. The technical metadata includes the system metadata which defines the data structures such as: Tables, fields, data types, indexes and partitions in the relational engine, and databases, dimensions, measures, and data mining models. Technical metadata defines the data model and the way it is displayed for the users, with the reports, schedules, distribution lists and user security rights.
  • Business metadata is content from the data warehouse described in more user-friendly terms. The business metadata tells you what data you have, where it comes from, what it means and what its relationship is to other data in the data warehouse. Business metadata may also serves as documentation for the DW/BI system. Users who browse the data warehouse are primarily viewing the business metadata.
  • Process metadata is used to describe the results of various operations in the data warehouse. Within the ETL process all key data from tasks are logged on execution. This includes start time, end time, CPU seconds used, disk reads, disk writes and rows processed. When troubleshooting the ETL or query process, this sort of data becomes valuable. Process metadata is the fact measurement when building and using a DW/BI system. Some organizations make a living out of collecting and selling this sort of data to companies - in that case the process metadata becomes the business metadata for the fact and dimension tables. Process metadata is in interest of business people who can use the data to identify the users of their products, which products they are using and what level of service they are receiving.

 

 For Support