Certified Data Mining and Warehousing Professional Single and Multidimensional association rules

Single and Multidimensional association rules
 


Boolean Association Rules
The Apriori Algorithm
 Level-wise search
Find L1, then L2, then L3,…, Lk

The Aprioriproperty:
 If Ais a frequent itemset, all its subsets are  frequent itemsets
 If A is not a frequent itemset, all its supersets  are NOT frequent

Why?
 No more transactions if we require more items

 

Apriori algorithm

Apriori is the best-known algorithm to mine association rules. It uses a breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support.

Apriori is a classic algorithm for learning association rules. Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation). Other algorithms are designed for finding association rules in data having no transactions (Winepi and Minepi), or having no timestamps (DNA sequencing).

As is common in association rule mining, given a set of itemsets (for instance, sets of retail transactions, each listing individual items purchased), the algorithm attempts to find subsets which are common to at least a minimum number C of the itemsets. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.

The purpose of the Apriori Algorithm is to find associations between different sets of data. It is sometimes referred to as "Market Basket Analysis". Each set of data has a number of items and is called a transaction. The output of Apriori is sets of rules that tell us how often items are contained in sets of data. Here is an example:

each line is a set of items

alpha beta gamma
alpha beta theta
alpha beta epsilon
alpha beta theta
  1. 100% of sets with alpha also contain beta
  2. 25% of sets with alpha, beta also have gamma
  3. 50% of sets with alpha, beta also have theta

Apriori uses breadth-first search and a Hash tree structure to count candidate item sets efficiently. It generates candidate item sets of length k from item sets of length k-1. Then it prunes the candidates which have an infrequent sub pattern. According to the downward closure lemma, the candidate set contains all frequent k-length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates.

Apriori, while historically significant, suffers from a number of inefficiencies or trade-offs, which have spawned other algorithms. Candidate generation generates large numbers of subsets (the algorithm attempts to load up the candidate set with as many as possible before each scan). Bottom-up subset exploration (essentially a breadth-first traversal of the subset lattice) finds any maximal subset S only after all 2^{|S|}-1 of its proper subsets.

Algorithm Pseudocode

The pseudocode for the algorithm is given below for a transaction database T, and a support threshold of \epsilon. Usual set theoretic notation is employed, though note that T is a multiset. C_k is the candidate set for level k. Generate() algorithm is assumed to generate the candidate sets from the large itemsets of the preceding level, heeding the downward closure lemma. count[c] accesses a field of the data structure that represents candidate set c, which is initially assumed to be zero. Many details are omitted below, usually the most important part of the implementation is the data structure used for storing the candidate sets, and counting their frequencies.

   Apriori

(T,\epsilon)

       

L_1 \gets \{

 large 1-itemsets 

 \}

       

k \gets 2

       while 

 L_{k-1} \neq \emptyset

           

C_k \gets \{ c |c = a \cup \{b\}  \and  a \in L_{k-1} \and b \in \bigcup L_{k-1} \and b \not \in  a  \}

           for transactions 

t \in T

               

C_t \gets \{ c | c \in C_k \land c \subseteq t \}

               for candidates 

c \in C_t

                   

count[c] \gets count[c]+1

           

L_k \gets \{ c |c \in C_k \and ~ count[c] \geq \epsilon \}

           

k \gets k+1

       return 

\bigcup_k L_k

Example

A large supermarket tracks sales data by stock-keeping unit (SKU) for each item, and thus is able to know what items are typically purchased together. Apriori is a moderately efficient way to build a list of frequent purchased item pairs from this data. Let the database of transactions consist of the sets {1,2,3,4}, {1,2}, {2,3,4}, {2,3}, {1,2,4}, {3,4}, and {2,4}. Each number corresponds to a product such as "butter" or "bread". The first step of Apriori is to count up the frequencies, called the support, of each member item separately:

This table explains the working of apriori algorithm.

Item Support
1 3/7
2 6/7
3 4/7
4 5/7

We can define a minimum support level to qualify as "frequent," which depends on the context. For this case, let min support = 3/7. Therefore, all are frequent. The next step is to generate a list of all pairs of the frequent items. Had any of the above items not been frequent, they wouldn't have been included as a possible member of possible pairs. In this way, Apriori prunes the tree of all possible sets. In next step we again select only these items (now pairs are items) which are frequent:

Item Support
{1,2} 3/7
{1,3} 1/7
{1,4} 2/7
{2,3} 3/7
{2,4} 4/7
{3,4} 3/7

The pairs {1,2}, {2,3}, {2,4}, and {3,4} all meet or exceed the minimum support of 3/7. The pairs {1,3} and {1,4} do not. When we move onto generating the list of all triplets, we will not consider any triplets that contain {1,3} or {1,4}:

Item Support
{2,3,4} 2/7

In the example, there are no frequent triplets -- {2,3,4} has support of 2/7, which is below our minimum, and we do not consider any other triplet because they all contain either {1,3} or {1,4}, which were discarded after we calculated frequent pairs in the second table.

 


Multi Dimensional Association Rules

Rules involving more than one dimensions or predicates
buys (X, “IBM Laptop Computer”) ->
                 buys (X, “HP Inkjet Printer”)
(Single dimensional)
age (X, “20 ..25” ) and occupation (X, “student”) ->
                buys (X, “HP Inkjet Printer”)
        (Multi Dimensional- Inter dimension Association Rule)
age (X, “20 ..25” ) and  buys (X, “IBM Laptop Computer”) ->
                buys (X, “HP Inkjet Printer”)
        (Multi Dimensional- Hybrid dimension Association Rule)

  • Attributes can be categorical or quantitative
  • Quantitative attributes are numeric and incorporates hierarchy (age, income..)
  • Numeric attributes must be discretized
  • 3 different approaches in mining multi dimensional association rules
    • Using static discretization of quantitative attributes
    • Using dynamic discretization of quantitative attributes
    • Using Distance based discretization with clustering

Mining using Static Discretization

  • Discretization is static and occurs prior to mining
  • Discretized attributes are treated as categorical
  • Use apriori algorithm to find all k-frequent predicate sets
  • Every subset of frequent predicate set must be frequent
  • If in a data cube the 3D cuboid (age, income, buys) is frequent implies (age, income), (age,buys), (income, buys)

Mining using Dynamic Discretization

  • Known as Mining Quantitative Association Rules
  • Numeric attributes are dynamically discretized
  • Consider rules of type

        Aquan1 Λ Aquan2 -> Acat
    (2D Quantitative Association Rules)
    age(X,”20…25”) Λ income(X,”30K…40K”) ->    buys (X, ”Laptop Computer”)

  • ARCS (Association Rule Clustering System) - An Approach for mining quantitative association rules

Distance-based Association Rule
2 step mining process

  • Perform clustering to find the interval of attributes involved
  • Obtain association rules by searching for groups of clusters that occur together

The resultant rules must satisfy

  • Clusters in the rule antecedent are strongly associated with clusters of rules in the consequent
  • Clusters in the antecedent occur together
  • Clusters in the consequent occur together

 

 For Support