Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes:
- operational or transactional data such as, sales, cost, inventory, payroll, and accounting
- nonoperational data, such as industry sales, forecast data, and macro economic data
- meta data - data about the data itself, such as logical database design or data dictionary definitions
The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when.
Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.
Dramatic advances in data capture, processing power, data transmission, and storage capabilities are enabling organizations to integrate their various databases into data warehouses. Data warehousing is defined as a process of centralized data management and retrieval. Data warehousing, like data mining, is a relatively new term although the concept itself has been around for years. Data warehousing represents an ideal vision of maintaining a central repository of all organizational data. Centralization of data is needed to maximize user access and analysis. Dramatic technological advances are making this vision a reality for many companies. And, equally dramatic advances in data analysis software are allowing users to access this data freely. The data analysis software is what supports data mining.
A Data Mining Glossary
Accuracy. A measure of a predictive model that reflects the proportionate number of times that the model is correct when applied to data.
Application Programming Interface (API). The formally defined programming language interface between a program (system control program, licensed program) and its user.
Artificial Intelligence. The scientific field concerned with the creation of intelligent behavior in a machine.
Artificial Neural Network (ANN). See Neural Network.
Association Rule. A rule in the form of “if this then that” that associates events in a database. For example the association between purchased items at a supermarket.
Back Propagation. One of the most common learning algorithms for training neural networks.
Binning. The process of breaking up continuous values into bins. Usually done as a preprocessing step for some data mining algorithms. For example breaking up age into bins for every ten years.
Brute Force Algorithm. A computer technique that exhaustively uses the repetition of very simple steps repeated in order to find an optimal solution. They stand in contrast to complex techniques that are less wasteful in moving toward and optimal solution but are harder to construct and are more computationally expensive to execute.
Cardinality. The number of different values a categorical predictor or OLAP dimension can have. High cardinality predictors and dimensions have large numbers of different values (e.g. zip codes), low cardinality fields have few different values (e.g. eye color).
CART. Classification and Regression Trees. A type of decision tree algorithm that automates the pruning process through cross validation and other techniques.
CHAID. Chi-Square Automatic Interaction Detector. A decision tree that uses contingency tables and the chi-square test to create the tree. Classification. The process of learning to distinguish and discriminate between different input patterns using a supervised training algorithm. Classification is the process of determining that a record belongs to a group.
Clustering. The technique of grouping records together based on their locality and connectivity within the n-dimensional space. This is an unsupervised learning technique.
Collinearity. The property of two predictors showing significant correlation without a causal relationship between them.
Clustering. The process of grouping similar input patterns together using an unsupervised training algorithm.
Conditional Probability. The probability of an event happening given that some event has already occurred. For example the chance of a person committing fraud is much greater given that the person had previously committed fraud.
Coverage. A number that represents either the number of times that a rule can be applied or the percentage of times that it can be applied.
CRM. See Customer Relationship Management.
Cross Validation (and Test Set Validation). The process of holding aside some training data which is not used to build a predictive model and to later use that data to estimate the accuracy of the model on unseen data simulating the real world deployment of the model.
Customer Relationship Management. The process by which companies manage their interactions with customers.
Data mining. The process of efficient discovery of nonobvious valuable patterns from a large collection of data.
Database Management System (DBMS). A software system that controls and manages the data to eliminate data redundancy and to ensure data integrity, consistency and availability, among other features.
Decision Trees. A class of data mining and statistical methods that form tree like predictive models.
Embedded Data Mining. An implementation of data mining where the data mining algorithms are embedded into existing data stores and information delivery processes rather than requiring data extraction and new data stores.
Entropy. A measure often used in data mining algorithms that measures the disorder of a set of data.
Error Rate. A number that reflects the rate of errors made by a predictive model. It is one minus the accuracy.
Expert System. A data processing system comprising a knowledge base (rules), an inference (rules) engine, and a working memory.
Exploratory Data Analysis. The processes and techniques for general exploration of data for patterns in preparation for more directed analysis of the data.
Factor Analysis. A statistical technique which seeks to reduce the number of total predictors from a large number to only a few “factors” that have the majority of the impact on the predicted outcome.
Field. The structural component of a database that is common to all records in the database. Fields have values. Also called features, attributes, variables, table columns, dimensions.
Front Office. The part of a company's computer system that is responsible for keeping track of relationships with customers.
Fuzzy Logic. A system of logic based on the fuzzy set theory.
Fuzzy Set. A set of items whose degree of membership in the set may range from 0 to 1.
Fuzzy system. A set of rules using fuzzy linguistic variables described by fuzzy sets and processed using fuzzy logic operations.
Genetic algorithm. A method of solving optimization problems using parallel search, based on Darwin's biological model of natural selection and survival of the fittest.
Genetic operator. An operation on the population member strings in a genetic algorithm which are used to produce new strings.
Gini Metric. A measure of the disorder reduction caused by the splitting of data in a decision tree algorithm. Gini and the entropy metric are the most popular ways of selected predictors in the CART decision tree algorithm.
Hebbian Learning. One of the simplest and oldest forms of training a neural network. It is loosely based on observations of the human brain. The neural net link weights are strengthened between any nodes that are active at the same time.
Hill Climbing Search. A simple optimization technique that modifies a proposed solution by a small amount and then accepts it if it is better than the previous solution. The technique can be slow and suffers from being caught in local optima.
Hypothesis Testing. The statistical process of proposing a hypothesis to explain the existing data and then testing to see the likelihood of that hypothesis being the explanation.
ID3. One of the earliest decision tree algorithms.
Independence (statistical). The property of two events displaying no causality or relationship of any kind. This can be quantitatively defined as occurring when the product of the probabilities of each event is equal to the probability of the both events occurring.
Intelligent Agent. A software application which assists a system or a user by automating a task. Intelligent agents must recognize events and use domain knowledge to take appropriate actions based on those events.
Kohonen Networks. A type of neural network where locality of the nodes learn as local neighborhoods and locality of the nodes is important in the training process. They are often used for clustering.
Knowledge Discovery. A term often used interchangeably with data mining.
Lift. A number representing the increase in responses from a targeted marketing application using a predictive model over the response rate achieved when no model is used.
Machine Learning. A field of science and technology concerned with building machines that learn. In general it differs from Artificial Intelligence in that learning is considered to be just one of a number of ways of creating an artificial intelligence.
Memory-Based Reasoning (MBR). A technique for classifying records in a database by comparing them with similar records that are already classified. A form of nearest neighbor classification.
Minimum Description Length (MDL) Principle. The idea that the least complex predictive model (with acceptable accuracy) will be the one that best reflects the true underlying model and performs most accurately on new data.
Model. A description that adequately explains and predicts relevant data but is generally much smaller than the data itself.
Nearest Neighbor. A data mining technique that performs prediction by finding the prediction value of records (near neighbors) similar to the record to be predicted.
Neural Network. A computing model based on the architecture of the brain. A neural network consists of multiple simple processing units connected by adaptive weights.
Nominal Categorical Predictor. A predictor that is categorical (finite cardinality) but where the values of the predictor have no particular order. For example, red, green, blue as values for the predictor “eye color”.
Occam’s Razor. A rule of thumb used by many scientists that advocates favoring the simplest theory that adequately explains (or predicts) an event. This is more formally captured for machine learning and data mining as the minimum description length principle.
On-Line Analytical Processing (OLAP). Computer-based techniques used to analyze trends and perform business analysis using multidimensional views of business data.
Ordinal Categorical Predictor. A categorical predictor (i.e. has finite number of values) where the values have order but do not convey meaningful intervals or distances between them. For example the values high, middle and low for the income predictor.
Outlier Analysis. A type of data analysis that seeks to determine and report on records in the database that are significantly different from expectations. The technique is used for data cleansing, spotting emerging trends and recognizing unusually good or bad performers.
Overfitting. The effect in data analysis, data mining and biological learning of training too closely on limited available data and building models that do not generalize well to new unseen data. At the limit, overfitting is synonymous with rote memorization where no generalized model of future situations is built.
Predictor. The column or field in a database that could be used to build a predictive model to predict the values in another field or column. Also called variable, independent variable, dimension, or feature.
Prediction. 1. Then or field in a database that currently has unknown value that will be assigned when a predictive model is run over other predictor values in the record. Also called dependent variable, target, classification. 2. The process of applying a predictive model to a record. Generally prediction implies the generation of unknown values within time series though in this book prediction is used to mean any process for assigning values to previously unassigned fields including classification and regression.
Predictive Model. A model created or used to perform prediction. In contrast to models created solely for pattern detection, exploration or general organization of the data.
Principle Components Analysis. A data analysis technique that seeks to weight the importance of a variety of predictors so that they optimally discriminate between various possible predicted outcomes.
Prior Probability. The probability of an event occurring without dependence on (conditional to) some other event. In contrast to conditional probability.
Radial Basis Function Networks. Neural networks that combine some of the advantages of neural networks with those of nearest neighbor techniques. In radial basis functions the hidden layer is made up of nodes that represent prototypes or clusters of records.
Record. The fundamental data structure used for performing data analysis. Also called a table row or example. A typical record would be the structure that contains all relevant information pertinent to one particular customer or account.
Regression. A data analysis technique classically used in statistics for building predictive models for continuous prediction fields. The technique automatically determines a mathematical equation that minimizes some measure of the error between the prediction from the regression model and the actual data.
Reinforcement learning. A training model where an intelligence engine (e.g. neural network) is presented with a sequence of input data followed by a reinforcement signal.
Relational Database (RDB). A database built to conform to the relational data model; includes the catalog and all the data described therein.
Response. A binary prediction field that indicates response or non response to a variety of marketing interventions. The term is generally used when referring to models that predict response or to the response field itself.
Sampling. The process by which only a fraction of all available data is used to build a model or perform exploratory analysis. Sampling can provide relatively good models at much less computational expense than using the entire database.
Segmentation. The process or result of the process that creates mutually exclusive collections of records that share similar attributes either in unsupervised learning (such as clustering) or in supervised learning for a particular prediction field.
Sensitivity Analysis. The process which determines the sensitivity of a predictive model to small fluctuations in predictor value. Through this technique end users can gauge the effects of noise and environmental change on the accuracy of the model.
Simulated Annealing. An optimization algorithm loosely based on the physical process of annealing metals through controlled heating and cooling.
Structured Query Language (SQL). A standard language for the access of data in a relational database.
Supervised learning. A class of data mining and machine learning applications and techniques where the system builds a model based on the prediction of a well defined prediction field. This is in contrast to unsupervised learning where there is no particular goal aside from pattern detection.
Support. The relative frequency or number of times a rule produced by a rule induction system occurs within the database. The higher the support the better the chance of the rule capturing a statistically significant pattern.
Targeted Marketing. The marketing of products to select groups of consumers that are more likely than average to be interested in the offer.
Time-series forecasting. The process of using a data mining tool (e.g., neural networks) to learn to predict temporal sequences of patterns, so that, given a set of patterns, it can predict a future value.
Unsupervised learning. A data analysis technique whereby a model is built without a well defined goal or prediction field. The systems are used for exploration and general data organization. Clustering is an example of an unsupervised learning system.
Visualization. Graphical display of data and models which helps the user in understanding the structure and meaning of the information contained in them.