Metadata for Data Warehousing

The term metadata is ambiguous, as it is used for two fundamentally different concepts (types). Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at design time the application contains no data. In this case the correct description would be "data about the containers of data". Descriptive metadata, on the other hand, is about individual instances of application data, the data content. In this case, a useful description (resulting in a disambiguating neologism) would be "data about data content" or "content about content" thus metacontent. Descriptive, Guide and the National Information Standards Organization concept of administrative metadata are all subtypes of metacontent.

Metadata (metacontent) is traditionally found in the card catalogs of libraries. As information has become increasingly digital, metadata is also used to describe digital data using metadata standards specific to a particular discipline. By describing the contents and context of data files, the quality of the original data/files is greatly increased. For example, a webpage may include metadata specifying what language it is written in, what tools were used to create it, and where to go for more on the subject, allowing browsers to automatically improve the experience of users.

Metadata (metacontent) is defined as data providing information about one or more aspects of the data, such as:

Means of creation of the data
Purpose of the data
Time and date of creation
Creator or author of data
Location on a computer network where the data was created
Standards used

For example, a digital image may include metadata that describes how large the picture is, the color depth, the image resolution, when the image was created, and other data. A text document's metadata may contain information about how long the document is, who the author is, when the document was written, and a short summary of the document.

Metadata is data. As such, metadata can be stored and managed in a database, often called a Metadata registry or Metadata repository. However, without context and a point of reference, it can be impossible to identify metadata just by looking at it. For example: by itself, a database containing several numbers, all 13 digits long could be the results of calculations or a list of numbers to plug into an equation - without any other context, the numbers themselves can be perceived as the data. But if given the context that this database is a log of a book collection, those 13-digit numbers may now be ISBNs - information that refers to the book, but is not itself the information within the book.

The term "metadata" was coined in 1968 by Philip Bagley, in his book "Extension of programming language concepts" where it is clear that he uses the term in the ISO 11179 "traditional" sense, which is "structural metadata" i.e. "data about the containers of data"; rather than the alternate sense "content about individual instances of data content" or metacontent, the type of data usually found in library catalogues. Since then the fields of information management, information science, information technology, librarianship and GIS have widely adopted the term. In these fields the word metadata is defined as "data about data". While this is the generally accepted definition, various disciplines have adopted their own more specific explanation and uses of the term.

Libraries

Metadata has been used in various forms as a means of cataloging archived information. The Dewey Decimal System employed by libraries for the classification of library materials is an early example of metadata usage. Library catalogues used 3x5 inch cards to display a book's title, author, subject matter, and a brief plot synopsis along with an abbreviated alpha-numeric identification system which indicated the physical location of the book within the library's shelves. Such data helps classify, aggregate, identify, and locate a particular book. Another form of older metadata collection is the use by US Census Bureau of what is known as the "Long Form." The Long Form asks questions that are used to create demographic data to create patterns and to find patterns of distribution. For the purposes of this article, an "object" refers to any of the following:

A physical item such as a book, CD, DVD, map, chair, table, flower pot, etc.
An electronic file such as a digital image, digital photo, document, program file, database table, etc.

Photographs

Metadata may be written into a digital photo file that will identify who owns it, copyright & contact information, what camera created the file, along with exposure information and descriptive information such as keywords about the photo, making the file searchable on the computer and/or the Internet. Some metadata is written by the camera and some is input by the photographer and/or software after downloading to a computer. However, not all digital cameras enable you to edit metadata; this functionality has been available on most Nikon DSLRs since the Nikon D3 and on most new Canon cameras since the Canon EOS 7D.

Photographic Metadata Standards are governed by organizations that develop the following standards. They include, but are not limited to:

IPTC Information Interchange Model IIM (International Press Telecommunications Council),
IPTC Core Schema for XMP
XMP – Extensible Metadata Platform (an ISO standard)
Exif – Exchangeable image file format, Maintained by CIPA (Camera & Imaging Products Association) and published by JEITA (Japan Electronics and Information Technology Industries Association)
Dublin Core (Dublin Core Metadata Initiative – DCMI)
PLUS (Picture Licensing Universal System).

Video

Metadata is particularly useful in video, where information about its contents (such as transcripts of conversations and text descriptions of its scenes) are not directly understandable by a computer, but where efficient search is desirable.

Web pages

Web pages often include metadata in the form of meta tags. Description and keywords meta tags are commonly used to describe the Web page's content. Most search engines use this data when adding pages to their search index.

Creation of metadata

Metadata can be created either by automated information processing or by manual work. Elementary metadata captured by computers can include information about when a file was created, who created it, when it was last updated, file size and file extension.

Metadata types

The metadata application is manyfold covering a large variety of fields of application there are nothing but specialised and well accepted models to specify types of metadata. Bretheron & Singley (1994) distinguish between two distinct classes: structural/control metadata and guide metadata. Structural metadata is used to describe the structure of computer systems such as tables, columns and indexes. Guide metadata is used to help humans find specific items and is usually expressed as a set of keywords in a natural language. According to Ralph Kimball metadata can be divided into 2 similar categories—Technical metadata and Business metadata. Technical metadata correspond to internal metadata, business metadata to external metadata. Kimball adds a third category named Process metadata. On the other hand, NISO distinguishes between three types of metadata: descriptive, structural and administrative. Descriptive metadata is the information used to search and locate an object such as title, author, subjects, keywords, publisher; structural metadata gives a description of how the components of the object are organised; and administrative metadata refers to the technical information including file type. Two sub-types of administrative metadata are rights management metadata and preservation metadata.

Metadata is one of the important keys to the success of the data warehousing and business intelligence effort. Metadata management answers these questions:

What is Metadata?
How can Metadata be Managed?
Extracting Metadata from Legacy Systems

What is Metadata?

Metadata is your control panel to the data warehouse. It is data that describes the data warehousing and business intelligence system:

Reports
Cubes
Tables (Records, Segments, Entities, etc.)
Columns (Fields, Attributes, Data Elements, etc.)
Keys
Indexes

Metadata is often used to control the handling of data and describes:

Rules
Transformations
Aggregations
Mappings

The power of metadata is that enables data warehousing personnel to develop and control the system without writing code in languages such as: Java, C# or Visual Basic. This saves time and money both in the initial set up and on going management.

Data Warehouse Metadata

Data warehousing has specific metadata requirements. Metadata that describes tables typically includes:

Physical Name
Logical Name
Type: Fact, Dimension, Bridge
Role: Legacy, OLTP, Stage,
DBMS: DB2, Informix, MS SQL Server, Oracle, Sybase
Location
Definition
Notes

Metadata describes columns within tables:

Physical Name
Logical Name
Order in Table
Datatype
Length
Decimal Positions
Nullable/Required
Default Value
Edit Rules
Definition
Notes

How can Data Warehousing Metadata be Managed?

Data warehousing and business intelligence metadata is best managed through a combination of people, process and tools.

The people side requires that people be trained in the importance and use of metadata. They need to understand how and when to use tools as well as the benefits to be gained through metadata.

The process side incorporates metadata management into the data warehousing and business intelligence life cycle. As the life cycle progresses metadata is entered into the appropriate tool and stored in a metadata repository for further use.

Metadata can be managed through individual tools:

Metadata manager / repository
Metadata extract tools
Data modeling
ETL
BI Reporting

Metadata Manager / Repository

Metadata can be managed through a shared repository that combines information from multiple sources.

The metadata manager can be purchased as a software package or built as "home grown" system. Many organizations start with a spreadsheet containing data definitions and then grow to a more sophisticated approach.

Extracting Metadata from Input Sources

Metadata can be obtained through a manual process of keying in metadata or through automated processes. Scanners can extract metadata from text such as SQL DDL or COBOL programs. Other tools can directly access metadata through SQL catalogs and other metadata sources.

Picking the appropriate metadata extract tools is a key part of metadata management.

Many data modeling tools include a metadata extract capability - otherwise known as "reverse engineering". Through this tool, database information about tables and columns can be extracted. The information can then be exported from the data modeling tool to the metadata manager.

It includes the following topics -

Certified Data Mining and Warehousing Professional Metadata for Data Warehousing