Cluster Discovery

Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters.

Clustering is a main task of explorative data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.

Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with low distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It will often be necessary to modify preprocessing and parameters until the result achieves the desired properties.

Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology (from Greek βότρυς "grape") and typological analysis. The subtle differences are often in the usage of the results: while in data mining, the resulting groups are the matter of interest, in automatic classification primarily their discriminative power is of interest. This often leads to misunderstandings between researchers coming from the fields of data mining and machine learning, since they use the same terms and often the same algorithms, but have different goals.

It includes the following topics -

The result of a cluster analysis shown as the coloring of the squares into three clusters.

The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.

A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression.

First the set is partitioned into groups based on data similarity (e.g., using clustering), and then labels are assigned to the relatively small number of groups.

It is also called unsupervised learning. Unlike classification, clustering and unsupervised learning do not rely on predefined classes and class-labeled training examples. For this reason, clustering is a form of learning by observation, rather than learning by examples.

Advantages of such a clustering-based process:

  1. adaptable to changes
  2. helps single out useful features that distinguish different groups.

By automated clustering, we can identify dense and sparse regions in object space and, therefore, discover overall distribution patterns and interesting correlations among data attributes.

Applications of Clustering:

  1. market research :can help marketers discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns.
  2. pattern recognition:
  3. data analysis: It can also be used to help classify documents on the Web for information discovery.
  4. image processing
  5. Biology: can be used to derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into structures inherent in populations.
  6. Geography:  help in the identification of areas of similar land use in an earth observation database and in the identification of groups of houses in a city according to house type, value, and geographic location
  7. Automobile insurance: the identification of groups of automobile insurance policy holders with a high average claim cost
  8. Outlier detection: the detection of credit card fraud and the monitoring of criminal activities in electronic commerce.

Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity. Clustering can also be used for outlier detection, where outliers (values that are “far away” from any cluster) may be more interesting than common cases.

The following are typical requirements of clustering in data mining:

  1. Scalability
  2. Ability to deal with different types of attributes
  3. Discovery of clusters with arbitrary shape
  4. Minimal requirements for domain knowledge to determine input parameters
  5. Ability to deal with noisy data
  6. Incremental clustering and insensitivity to the order of input records
  7. High dimensionality