Certified Data Mining and Warehousing Professional Data and architecture design

Data and architecture design

Data architecture in Information Technology is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations. A Data Architecture is often the design of data for use in defining the target state and the subsequent planning needed to achieve the target state. It is usually one of several architecture domains that form the pillars of an enterprise architecture or solution architecture.

A data architecture should set data standards for all its data systems as a vision or a model of the eventual interactions between those data systems. Data integration, for example, should be dependent upon data architecture standards since data integration requires data interactions between two or more data systems. A data architecture, in part, describes the data structures used by a business and its computer applications software. Data architectures address data in storage and data in motion; descriptions of data stores, data groups and data items; and mappings of those data artifacts to data qualities, applications, locations etc.

Essential to realizing the target state, Data Architecture describes how data is processed, stored, and utilized in a given system. It provides criteria for data processing operations that make it possible to design data flows and also control the flow of data in the system.

The Data Architect is typically responsible for defining the target state, aligning during development and then following up to ensure enhancements are done in the spirit of the original blueprint.

During the definition of the target state, the Data Architecture breaks a subject down to the atomic level and then builds it back up to the desired form. The Data Architect breaks the subject down by going through 3 traditional architectural processes:

  • Conceptual - represents all business entities.
  • Logical - represents the logic of how entities are related.
  • Physical - the realization of the data mechanisms for a specific type of functionality.

The "data" column of the Zachman Framework for enterprise architecture –

Layer View Data (What) Stakeholder
1 Scope/Contextual List of things and architectural standards important to the business Planner
2 Business Model/Conceptual Semantic model or Conceptual/Enterprise Data Model Owner
3 System Model/Logical Enterprise/Logical Data Model Designer
4 Technology Model/Physical Physical Data Model Builder
5 Detailed Representations Actual databases Subcontractor

In this second, broader sense, data architecture includes a complete analysis of the relationships between an organization's functions, available technologies, and data types.

Data architecture should be defined in the planning phase of the design of a new data processing and storage system. The major types and sources of data necessary to support an enterprise should be identified in a manner that is complete, consistent, and understandable. The primary requirement at this stage is to define all of the relevant data entities, not to specify computer hardware items. A data entity is any real or abstracted thing about which an organization or individual wishes to store data.

Data Design

Designing data is about discovering and completely defining your application's data characteristics and processes. Data design is a process of gradual refinement, from the coarse "What data does your application require?" to the precise data structures and processes that provide it. With a good data design, your application's data access is fast, easily maintained, and can gracefully accept future data enhancements.

The process of data design includes identifying the data, defining specific data types and storage mechanisms, and ensuring data integrity by using business rules and other run-time enforcement mechanisms.

Data describes a real-world information resource that is important to your application. Data describes the things, people, products, items, customers, assets, records, and — ultimately — data structures that your application finds useful to categorize, organize, and maintain.

Identifying data is an iterative process. At first you may just know some vague, high-level details about how the application must handle its information. As you keep expanding your knowledge of the application's intended business processes, you continue filling in more details.

As you begin documenting the data requirements for your application, the description for each item of data typically includes:

  • Name.
  • General description (what is it?).
  • Ownership (who is responsible for it?).
  • Data characteristics (how is it measured and how big or small can it be?).
  • Logical events, processes, and relationships (how and when is it created, modified, and used?).

It is worth noting that data has many different characteristics. Part of the process of data design is to specify how to quantify each data item. Some typical data characteristics are:

  • Location attributes (address, country, warehouse bin).
  • Physical attributes (weight, dimension, volume, color, material, texture).
  • Conceptual attributes (name, rank, serial number).
  • Relational attributes (assemblies consist of sub-assemblies, authors write multiple books).
  • Value attributes (currency, goodwill, esteem).

The process of identifying data requires interviews, analysis of existing data structures, document preparation, and peer reviews. The eventual result is a documented, conceptual view of your application's information that answers the data questions of "What, where, when, and why?" Generally, this is an early-stage exploration of how the various departments, organizations, and your application need to use data.


The general approach to defining data includes the following:

  • Defining the tables, rows, and columns.
  • Inserting index keys.
  • Creating table relationships.
  • Assigning data types.

Defining the Tables, Rows, and Columns

Regardless of how your application's data is physically stored, the data is typically organized into multiple tables (or files), each having a set of rows (or records) and columns (or fields), similar to the rows and columns of a spreadsheet. Each row in the table contains all of the information about a particular thing, person, product, item, customer, or asset.

For example, the following Authors table stores the name, address, and phone number for several authors.

Row Name Address Phone
1 Arnie Baldwin xxxxx xxxxxxxx (xxx) xxx-xxxx
2 Don Hall xxxxx xxxxxxxx (xxx) xxx-xxxx
3 Teresa Atkinson xxxxx xxxxxxxx (xxx) xxx-xxxx
4 David Simpson xxxxx xxxxxxxx (xxx) xxx-xxxx

Inserting Index Keys

A key is a special field that provides an index for fast retrieval. A key can be unique or non-unique, depending on whether duplicates are allowed. A key can be designated as the primary key, making it the unique identifier for each row of the table. You should use keys where your application needs direct access to specific rows.

For example, in the following table the author's identification number (au_id) is added as the table's primary key, so that au_id uniquely identifies one and only one author. A query using a value for au_id will provide very fast retrieval of that author's information.

Authors table
au_id (key)

Creating Table Relationships

A database is usually composed of more than one table, and the tables are often related to one another in various ways, often by using foreign keys. For instance, a Titles table might list the International Standard Book Number (ISBN), title, and year the book was published. It would also be useful to identify the publisher for each title. Rather than repeating all of the publisher information for each title in the Titles table, you could create a Publishers table and establish a relationship with the Titles table by including the publisher's identification (pu_id) in the Titles table as a foreign key.

In the following example, the Publishers table is related to the Titles table:

Titles table Publishers table
ti_isbn (key) pu_id (key)
ti_title pu_name
ti_yearpublished pu_address
pu_id (foreign key) pu_phone

This is called a one-to-many relationship because a single row in the Publishers table can be related to one or more rows in the Titles table (because a publisher will handle many titles). You can also create relationships that are one-to-one and many-to-many.

It is worth noting that you have only identified that a relationship exists between the Publishers and Titles tables — you have made no commitment to how that relationship will be managed. Depending on your final implementation, you might use the table joins and foreign key constraints available with SQL Server, or you might write custom code to read the file structures directly and handle referential integrity inside your application.

Assigning Data Types

A data type is a named category of data characterized by a set of values, a way to denote the values, and some implied operations that can interpret and manipulate the values. Data types can be either intrinsic or derived.

An intrinsic data type is one that is provided by your database. For example, SQL Server provides intrinsic data types such as integer, datetime, bit, char, and varchar.

Derived data types are defined using the Data Modeling Language (DML) provided by your database. A derived data type is built from the available intrinsic data types or previously defined derived data types. You typically provide a name and a structure for the derived data type. With derived data types, you can assure consistent use of special data types for selected columns, variables, and parameters.

Data types are important because they assure that the assigned data value is of the correct type and within the acceptable range of values. Different data storage technologies and development languages support a variety of data types, including the following.

  • Boolean.
  • Integer.
  • Float.
  • Datetime.
  • Tinyint.
  • COMP-3.
  • Binary.
  • String.
  • Character.

While assigning data types, you want to be sure that the range provided by the data type fits the data which will be stored and — hopefully — anticipates future changes. For example, if you choose tinyint as a data type for customer identification, your application can handle a maximum of only 255 customers. On the other hand, if you choose the integer data type, you can have over two billion customers. As another example, if you use a single character to indicate customer service code, a later expansion to two characters will require difficult changes throughout your application.

You can save space in your database and improve join operations by choosing appropriate data types for fields. As a general rule, you should choose the smallest data type that is appropriate for the data in the field.

While you are assigning data types, some things to think about are:

  • Allowable maximums and minimums.
  • Default values.
  • Empty (or NULL) values.
  • Anticipated growth.
  • Expected and (to the extent possible) unexpected changes.

In a relational database environment, data types help enforce the business rules. For example, you cannot add dollars and colors and get a useful answer. While you would never program such an addition process intentionally, a relational database will identify the data type mismatch and automatically deny your query.

Data Warehousing Architecture
Architecture, in the context of an organization's data warehousing efforts, is a conceptualization of how the data warehouse is built. There is no right or wrong architecture, rather multiple architectures exist to support various environments and situations. The worthiness of the architecture can be judged in how the conceptualization aids in the building, maintenance, and usage of the data warehouse.

One possible simple conceptualization of a data warehouse architecture consists of the following interconnected layers:
Operational database layer - The source data for the data warehouse - An organization's Enterprise Resource Planning systems fall into this layer.
Data access layer -   The interface between the operational and informational access layer - Tools to extract, transform, load data into the warehouse fall into this layer.
Metadata layer -    The data directory - This is usually more detailed than an operational system data directory. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool.
Informational access layer -   The data accessed for reporting and analyzing and the tools for reporting and analyzing data -

A data warehouse constructed from integrated data source systems does not require ETL, staging databases, or operational data store databases. It identifies and describes each architectural component.  The model is useful in understanding key Data Warehousing concepts, terminology, problems and opportunities.

 For Support