Skip to content
BY 4.0 license Open Access Published by De Gruyter Open Access November 11, 2022

An algebraic semigroup method for discovering maximal frequent itemsets

  • Jiang Liu EMAIL logo , Jing Li , Feng Ni , Xiang Xia , Shunlong Li and Wenhui Dong
From the journal Open Mathematics

Abstract

Discovering maximal frequent itemsets is an important issue and key technique in many data mining problems such as association rule mining. In the literature, generating maximal frequent itemsets proves either to be NP-hard or to have O ( l 3 4 l ( m + n ) ) complexity in the worst case from the perspective of generating maximal complete bipartite graphs of a bipartite graph, where m , n are the item number and the transaction number, respectively, and l denotes the maximum of C Ψ ( C ) / ( C + Ψ ( C ) 1 ) , with the maximum taken over all maximal frequent itemsets C . In this article, we put forward a method for discovering maximal frequent itemsets, whose complexity is O ( 3 m n 2 β + 4 β n ) , lower than the known complexity both in the worst case, from the perspective of semigroup algebra, where β is the number of items whose support is more than the minimum support threshold. Experiments also show that an algorithm based on the algebraic method performs better than the other three well-known algorithms. Meanwhile, we explore some algebraic properties with respect to items and transactions, prove that the maximal frequent itemsets are exactly the simplified generators of frequent itemsets, give a necessary and sufficient condition for a maximal i + 1 -frequent itemset being a subset of a closed i -frequent itemset, and provide a recurrence formula of maximal frequent itemsets.

MSC 2010: 20-08; 68W99

1 Introduction

In 1993, Agrawal et al. [1] introduced the problem of mining a large collection of basket data-type transactions for association rules between sets of items with minimal confidence threshold and presented an efficient algorithm for this purpose. Association rule mining has since been a research focus and has become a key technique in data mining, with broad and successful applications, such as in market analysis, financial investment, health, environmental protection, product manufacturing, and made a very significant social and economic benefit.

A key component in association rule mining problem (e.g., [2,3,4]) and other data mining problems such as episode [5] and minimal keys [6] is to discover frequent itemsets and maximal frequent itemsets. Maximal frequent itemset (MFI) can be used to improve the performance of discovering frequent itemsets. Furthermore, it suffices to know only the maximal frequent itemsets in many data mining applications, such as the minimal key discovery and the theory extraction [6].

Most of the research work with respect to discovering maximal frequent itemsets focused on developing deterministic algorithms (e.g., Bayardo [7]; Eppstein [8]; Lin and Kedem [9]; Boros et al. [10]; Dhabu and Deshpande [11]; Kabir et al. [12]; Karim et al. [13]; Halim et al. [14]), or complexity analysis (Eppstein [8]; Boros et al. [10]), or developing approximation algorithms (e.g., Fatemi et al. [15]; Zhang et al. [16]), or developing maximal frequent itemsets with added constraints such as maximal diverse frequent itemsets (e.g., Wu et al. [17]). Boros et al. [10] claimed that maximal t-frequent sets cannot be generated efficiently, unless P = NP . Eppstein [8] turned the problem of discovering maximal frequent itemsets into finding all maximal complete bipartite graphs of a bipartite graph on m + n vertices and gave an algorithm with the complexity O ( l 3 4 l ( m + n ) ) in the worst case, where m , n are the item number and the transaction number, respectively, and l denotes the maximum of C Ψ ( C ) / ( C + Ψ ( C ) 1 ) , with the maximum taken over all maximal frequent itemsets C .

In this article, we put forward a semigroup algebraic method for mining maximal frequent itemsets, whose complexity is O ( 3 m n 2 β + 4 β n ) (in practical applications), lower than the known complexity O ( l 3 4 l ( m + n ) ) given by Eppstein [8] both in the worst case, where β is the number of items whose support is more than the minimum support threshold.

First, we use algebraic language to describe the natural algebraic structure of items and transactions and explore some algebraic properties. For instance, there is a semigroup homomorphism between itemsets and transactions.

Under the algebraic framework, we give the explicit forms of simplified generators of frequent itemsets and prove that the simplified generators are maximal frequent itemsets and vice versa. Then we define basic itemsets and sim-basic itemsets and use them to construct the recurrence formula of maximal frequent itemsets. Finally, the recurrence formula is used to provide methods for discovering maximal frequent itemsets whose complexity proves to be much lower than the known complexity both in the worst case. Experimental comparisons also indicate a better performance of an algorithm based on the method.

2 Preliminaries

In this section, we briefly review some basic notations about algebraic semigroup and frequent itemset mining. Interested readers can refer to relative literature (e.g., Agrawal et al. [1]; Clifford and Preston [18]) for more details or more information with respect to the approaches to frequent itemset mining (e.g., Luna [19]).

Let be a set of items, and T be a set of transactions, where each transaction has a unique identifier and contains a set of items (also called itemset). The support of an itemset X , denoted by supp ( X ) , is the number of transactions in which it occurs as a subset. An itemset is frequent if its support is not less than a pre-specified minimum support threshold min sup . X is a maximal frequent itemset if there is no other itemset X such that X is frequent and X X . X is a closed itemset if there is no other itemset X such that X X and supp ( X ) = supp ( X ) .

A binary operation on a set S is a mapping of S × S into S , where S × S is the set of all ordered pairs of elements of S . If the mapping is denoted by a dot ( ) , the image in S of the element ( a , b ) of S × S will be denoted by a b . Frequently, we shall omit the dot, writing a b for a b . Other symbols which may be used to denote binary operations are + , , .

A semigroup is a set S together with a binary operation ( ) , such that the operation is associative, i.e., a , b , c S , a ( b c ) = ( a b ) c . Denote the semigroup by S ( ) or S for simplicity when there is no danger of ambiguity, and say that S is a semigroup with respect to . If a b = b a , then S is called a commutative semigroup. If e S and a S , a e = a = e a , then e is an identity element.

A subset T of S is called a sub-semigroup of S if a T and b T imply a b T .

Suppose T is a non-empty subset of S . The sub-semigroup T is the set of all elements of S expressible as finite products of elements of T . T is called a set of generators of T .

Let S and S be groupoids. A mapping ϕ of S into S is called a homomorphism if ( a b ) ϕ = ( a ϕ ) ( b ϕ ) for all a , b in S .

3 Algebraic properties

In this section, we investigate the algebraic semigroup structure and some basic algebraic properties for items and transactions.

In what follows, we always assume that the element number of item set is m , and the element number of transaction set T is n .

We use 2 to denote the power set of , i.e., the set of all subsets of . And we denote the power set of T by 2 T .

For any two elements X 1 , X 2 2 , since X 1 X 2 2 , we can define the binary operation as follows:

X 1 X 2 = X 1 X 2 .

has the following properties, for any X 1 , X 2 , X 3 2 :

  1. X 1 ( X 2 X 3 ) = ( X 1 X 2 ) X 3 (associativity);

  2. X 1 = X 1 (an identity element);

  3. X 1 X 2 = X 2 X 1 (commutativity).

Hence, 2 is a commutative monoid with respect to the operation , which has an identity element .

Similarly, if we define the binary operation as X 1 X 2 = X 1 X 2 , then 2 ( ) is also a commutative monoid, which has an identity element .

Proposition 1

(Basic properties of itemsets) Under and , 2 has the following properties, X , X 1 , X 2 , , X k 2 ,

  1. X 1 ( X 2 X ) = ( X 1 X 2 ) ( X 1 X ) .

  2. X 1 ( X 2 X ) = ( X 1 X 2 ) ( X 1 X ) .

  3. X X = X , X X = X .

  4. ( X 1 X 2 ) X i = X i , i = 1 , 2 .

  5. If X 1 X k X = X , then X i X = X , i = 1 , 2 , , k .

From (i) and (iv), we directly have

( X 1 X 2 ) X i = X i , i = 1 , 2 .

Similarly, 2 T ( ) and 2 T ( ) are also commutative monoids, which take and T as their identity elements.

The following mapping describes the support of an itemset X .

Definition 1

We define a mapping Ψ : 2 2 T as follows: For X 2 and X , first find all the transactions t 1 , t 2 , , t s T , such that X t i . Then define Ψ ( X ) = { t 1 , t 2 , , t s } . For , define Ψ ( ) = T .

In fact, the mapping described in Definition 1 is a homomorphism.

Theorem 1

(Homomorphism between itemsets and transactions) Ψ is a homomorphism, i.e., if X 1 , X 2 2 , then Ψ ( X 1 X 2 ) = Ψ ( X 1 ) Ψ ( X 2 ) .

Proof

If Ψ ( X 1 ) Ψ ( X 2 ) = , then obviously Ψ ( X 1 X 2 ) Ψ ( X 1 ) Ψ ( X 2 ) . Otherwise, let t Ψ ( X 1 ) Ψ ( X 2 ) . Then X 1 , X 2 t . Hence, X 1 X 2 t , t Ψ ( a 1 a 2 ) , implying Ψ ( a 1 a 2 ) Ψ ( a 1 ) Ψ ( a 2 ) .

Conversely, if Ψ ( X 1 X 2 ) = , then obviously Ψ ( X 1 X 2 ) Ψ ( X 1 ) Ψ ( X 2 ) . Otherwise, let t Ψ ( X 1 X 2 ) . Then X 1 X 2 t . Hence, X 1 , X 2 t , t Ψ ( X 1 ) Ψ ( X 2 ) , implying Ψ ( X 1 X 2 ) Ψ ( X 1 ) Ψ ( X 2 ) .□

4 Simplified generators of frequent itemsets

In this section, we first present explicit forms of simplified generators for frequent itemsets. Then the maximal frequent itemsets prove to be the simplified generators.

Definition 2

Let min sup and FI be the minimal support threshold and the set of frequent itemsets, respectively. Suppose U FI . For each element X in FI , there exists an element { I 1 , , I h } in U , such that X { I 1 } , , { I h } . We say that U can generate all the frequent itemsets.

Lemma 1

  1. If Y k , and Y Y , then there exists Y 1 Y such that Y 1 = k , and Y 1 Y = Y 1 .

  2. If Y 1 Y 2 = Y 1 , then Y 1 Y 2 .

Proof

(1) We can obtain a Y ’s subset which has k elements by removing Y k elements from Y . This completes the proof.

(2) It is obviously true by noting that Y 1 is a subset of Y 2 .□

Theorem 2

(Simplified generators of frequent itemsets) Let min sup be the minimal support threshold. First, find the elements Y 1 , , Y s in Y such that Y i = min sup , and then find the elements I 1 i , , I h i i in such that Ψ ( { I j i } ) Y i = Y i for 1 j h i . Let FI be the set of frequent itemsets.

Furthermore, in the set Γ = { 1 , 2 , , s } , for any two integers i , j , if { I 1 i , , I h i i } is a subset of { I 1 j , , I h j j } , then remove i from Γ . Finally, we obtain the set Γ . Then

  1. FI = i = 1 s { I 1 i } , , { I h i i } . In other words, { { I 1 i , , I h i i } i = 1 , , s } can generate all the frequent itemsets. { I 1 i , , I h i i } is called the G-frequent itemset for Y i .

  2. FI = i Γ { I 1 i } , , { I h i i } . We call { { I 1 i , , I h i i } i Γ } the set of simplified generators for min sup .

Proof

(1) Suppose X 0 = { I a 1 , , I a p } FI , I a i { I 1 k , , I h k k } . Since Y k Ψ ( { I j k } ) = Y k , Ψ ( X 0 ) Y k = Y k . Furthermore, Ψ ( X 0 ) Y k = min sup , implying that X 0 is a frequent itemset.

Conversely, suppose X 0 is a frequent itemset, where X 0 = { I a 1 , , I a p } , I a i . Then Ψ ( { I a 1 } ) Ψ ( { I a 2 } ) Ψ ( { I a p } ) min sup . By Lemma 1, there exists Y i Y such that Y i = min sup , and Ψ ( { I a 1 } ) Ψ ( { I a 2 } ) Ψ ( { I a p } ) Y i = Y i . Therefore, by Proposition 1, Ψ ( { I a j } ) Y i = Y i , which completes the proof.

(2) The statement is obviously true since for two subsets A , B of X , if A B , then A B .□

Lemma 2

Suppose U can generate all the frequent itemsets, and A B for any two elements A , B U . Then U is the set of maximal frequent itemsets.

Proof

This can be directly verified by the definition of maximal frequent itemset.□

The following statement shows that the maximal frequent itemsets are simplified generators.

Corollary 1

(Explicit forms of maximal frequent itemsets) Let { { I 1 i , , I h i i } i Γ } be the set of simplified generators. Then it is the set of maximal frequent itemsets.

Proof

By Theorem 2, { { I 1 i , , I h i i } i Γ } can generate all the frequent itemsets, and for any j i , { I 1 i , , I h i i } is not a subset of { I 1 j , , I h j j } . Hence, { { I 1 i , , I h i i } i Γ } is the set of maximal frequent itemsets by Lemma 2.□

5 Maximal frequent itemset discovering and complexity analysis

In this section, we first define basic itemset and sim-basic itemset and give a necessary and sufficient condition for a maximal i + 1 -frequent itemset being a subset of a closed i -frequent itemset, then provide a recurrence formula for maximal frequent itemsets. Finally, an algorithm for discovering maximal frequent itemsets is presented, based on the recurrence formula, and the complexity proves to be smaller than the known complexity given in [8] both in the worst case.

Given a minimal support threshold t , a maximal frequent itemset whose support is t is either a basic itemset or a sim-basic itemset, which will be proved in Theorem 3. Basic itemsets and sim-basic itemsets are defined as follows.

Definition 3

For 0 i n , ( i ) denotes a subset of such that for I ( i ) , Ψ ( I ) = i . Divide ( i ) into disjoint subsets denoted by X 1 ( i ) , , X i s ( i ) , such that for any two elements I 1 , I 2 ( i ) , Ψ ( { I 1 } ) = Ψ ( { I 2 } ) if and only if I 1 , I 2 are in the same subset. For each subset X j ( i ) ( 1 j i s ), we use X ˜ j ( i ) to denote the itemset { I Ψ ( X j ( i ) ) Ψ ( { I } ) , I } , and call it a basic itemset (for i ).

Let P be the sub-semigroup of ( X , ) , generated by { X ˜ j ( k ) 1 j k s , i + 1 k n } . Suppose X 0 P . If Ψ ( X 0 ) = i , and for any basic itemset X ˜ X 0 , Ψ ( X 0 ) Ψ ( X ˜ ) , then we call X 0 a sim-basic itemset (for i ).

Remark 1

It is clear that Ψ ( X ˜ j ( i ) ) = Ψ ( X j ( i ) ) = Ψ ( { I 0 } ) by the definition of basic itemset, where I 0 X j ( i ) .

There is an equivalent description of basic itemsets and sim-basic itemsets as follows.

Proposition 2

  1. X is a basic itemset if and only if X is a closed itemset and there exists I 0 such that Ψ ( { I 0 } ) = Ψ ( X ) .

  2. X is a sim-basic itemset if and only if X is a closed itemset and Ψ ( { I } ) Ψ ( X ) for any I .

Proof

(1) Suppose X is a basic itemset. By Definition 3, there exists I 0 such that Ψ ( { I 0 } ) = Ψ ( X ) . For I X , Ψ ( X ) Ψ ( { I } ) according to Definition 3. Hence, Ψ ( { I } X ) Ψ ( X ) , implying that X is a closed itemset.

Conversely, suppose X is a closed itemset and there exists I 0 such that Ψ ( { I 0 } ) = Ψ ( X ) . According to the definition of closed itemsets, I 0 X . Let X be the basic itemset { I Ψ ( I 0 ) Ψ ( I ) } . Then X is a closed itemset. Since X and X are both closed itemsets and Ψ ( X ) = Ψ ( X ) = Ψ ( { I 0 } ) , X is identical to X , implying that X is a basic itemset.

(2) Suppose X is a sim-basic itemset whose support is i ( 1 i n ). Assume that X is not a closed itemset, then there exists an itemset Y 0 such that X Y 0 and Ψ ( Y 0 ) = Ψ ( X ) . Hence, there exists an item I 0 such that I 0 Y 0 X . Let X ˜ 0 be the basic itemset { I Ψ ( { I 0 } ) Ψ ( { I } ) , I } . It is clear that X ˜ 0 X since I 0 X , but Ψ ( X ) = Ψ ( Y 0 ) Ψ ( { I 0 } ) = Ψ ( X ˜ 0 ) , contradicting with Definition 3. Therefore, X is a closed itemset. Now we assume that there exists I , such that Ψ ( { I } ) = Ψ ( X ) . Then I X , since X is a closed itemset. Therefore, there exists a basic itemset X ˜ j ( k ) with 1 j k s , i + 1 k n , such that I X ˜ j ( k ) , noting that X is in the sub-semigroup generated by { X ˜ j ( k ) 1 j i s , i + 1 k n } by the definition of sim-basic itemset. This contradicts with the fact Ψ ( { I } ) = Ψ ( X ) = i .

Conversely, suppose X = { I 1 , I 2 , , I q } is a closed itemset whose support is i and Ψ ( { I } ) Ψ ( X ) for any I . Then Ψ ( { I u } ) > Ψ ( X ) = i for 1 u q . Let X be X ˜ 1 X ˜ 2 X ˜ q , where X ˜ u is the basic itemset { I Ψ ( { I u } ) Ψ ( { I } ) , I } . Then Ψ ( X ) = Ψ ( X ) by Remark 1. Note that X is a closed itemset. Hence, X = X . Besides, X P since Ψ ( X ˜ u ) = Ψ ( { I u } ) > i , where P is the sub-semigroup of ( X , ) , generated by { X ˜ j ( k ) 1 j k s , i + 1 k n } . Therefore, to prove that X is a sim-basic itemset, it suffices to prove that for any basic itemset X ˜ X , Ψ ( X ) Ψ ( X ˜ ) . Assume that there exists X ˜ 0 X , such that Ψ ( X ) Ψ ( X ˜ 0 ) . Since there exists I 0 X ˜ 0 such that Ψ ( { I 0 } ) = Ψ ( X ˜ 0 ) by the definition of basic itemset, I 0 X and Ψ ( X ) Ψ ( { I 0 } ) . Hence, Ψ ( X { I 0 } ) = Ψ ( X ) , contradicting with the fact that X = X is a closed itemset.□

The following is a preparation for Theorem 3.

Lemma 3

Let S i , i , and MFI i + 1 be the set of sim-basic itemsets for i , the set of basic itemsets for i, and the set of maximal frequent itemsets when the minimum support threshold is i + 1 , respectively. Then each element in i cannot be a subset of any element in S i MFI i + 1 . Each element in S i cannot be a subset of any element in i MFI i + 1 .

Proof

By Definition 3, for X 0 i , there exists I 0 such that Ψ ( { I 0 } ) = i and I 0 X 0 . And for X 1 S i MFI i + 1 , I 0 X 1 since X 1 either is in the sub-semigroup generated by { X ˜ j ( k ) 1 j k s , i + 1 k n } by the definition of sim-basic itemset or Ψ ( X 1 ) > i . Hence, X 0 cannot be a subset of X 1 .

Suppose X 0 t , X 1 S i , and X 2 MFI i + 1 . Then X 1 cannot be a subset of X 2 since Ψ ( X 2 ) i + 1 and Ψ ( X 1 ) = i . Since X 0 X 1 as we have proved, Ψ ( X 1 ) Ψ ( X 0 ) by the definition of sim-basic itemset. Hence, Ψ ( X 1 ) Ψ ( X 0 ) . Note that Ψ ( X 1 ) = Ψ ( X 0 ) = i . Therefore, X 1 cannot be a subset of X 0 .□

Based on basic itemsets and sim-basic itemsets, the recurrence formula of maximal frequent itemsets can be deduced as follows.

Theorem 3

Let MFI i + 1 , MFI i be the sets of maximal frequent itemsets when the minimum support thresholds are i + 1 and i , respectively, then MFI i = MFI ˜ i + 1 i S i , where S i and i have been defined in Lemma 3, and MFI ˜ i + 1 can be obtained by deleting the elements of MFI i + 1 that are subsets of the elements in S i i .

Proof

Suppose X MFI i . There are three cases.

Case 1. If Ψ ( X ) = i and Ψ ( { I } ) Ψ ( X ) for any I , then X is a closed itemset since it is a maximal frequent itemset. Furthermore, X is a sim-basic itemset by Proposition 2, i.e., X S i .

Case 2. If  Ψ ( X ) = i and there exists I 0 such that Ψ ( { I 0 } ) = Ψ ( X ) , then X is the G-frequent itemset for Ψ ( X ) by Theorem 2 and Corollary 1, i.e., X = { I Ψ ( X ) Ψ ( { I } ) , I } . Let X ˜ 0 be the basic itemset { I Ψ ( { I 0 } ) Ψ ( { I } ) , I } , which is exactly X . Hence, X i .

Case 3. If  Ψ ( X ) > i , then X MFI ˜ i + 1 for the following reasons.

Assume that X MFI ˜ i + 1 , then there exists a maximal frequent itemset for min sup = i + 1 , denoted by X 0 , such that X X 0 . Since X 0 is also a frequent itemset for min sup = i , X 0 is a subset of a maximal frequent itemset for min sup = i . Therefore, X is a proper subset of a maximal frequent itemset for min sup = i , which contradicts with the fact that X is a maximal frequent itemset for min sup = i .

Now we claim that MFI ˜ i + 1 i S i generate all the frequent itemsets. This can be verified by the fact that each element in MFI ˜ i + 1 i S i is a frequent itemset for min sup = i , and MFI i MFI ˜ i + 1 i S i as we have proved. Besides, X 1 X 2 for X 1 , X 2 MFI ˜ i + 1 i S i by Lemma 3. Therefore, MFI ˜ i + 1 i S i is the set of maximal frequent itemsets for min sup = i by Lemma 2.□

Lemma 4

Suppose X MFI i + 1 . Then X is a subset of an element in S i i , if and only if there exists I 0 such that Ψ ( X { I 0 } ) = i .

Proof

Suppose there exists I 0 such that Ψ ( X { I 0 } ) = i . Then X must be a subset of a closed itemset whose support is i . Since S i i is the set of closed itemsets whose support is i according to Proposition 2, X is a subset of an element in S i i .

Suppose X is a subset of an element in S i i . Then there exists an itemset X 0 such that Ψ ( X X 0 ) = i and X X 0 is a closed itemset. Let I 0 be an element in X 0 . Assume that Ψ ( X { I 0 } ) < i , then Ψ ( X 0 X ) Ψ ( X { I 0 } ) < i , contradicting with that Ψ ( X X 0 ) = i . Hence, Ψ ( X { I 0 } ) i . Since X MFI i + 1 , Ψ ( X { I 0 } ) < i + 1 by the definition of maximal frequent itemsets. Finally, we obtain that Ψ ( X { I 0 } ) = i , completing the proof.□

Definition 4

Let i and S i be the set of basic itemsets and sim-basic itemsets for i , respectively. Find the subset of i (or S i ) such that for each element X in the subset, there exists I such that Ψ ( { I } X ) = k with 0 k i 1 , then the subset is called a Re- k subset of i (or S i ), and denoted by i k (or S i k ).

Theorem 3 and Lemma 4 immediately lead to the following theorem.

Theorem 4

Given a minimum support threshold t, suppose t + 1 i n , for k = t , t + 1 , , i 1 , delete the Re- k subset from i (or S i ), and obtain a new set denoted by ˜ i (or S ˜ i ). Then MFI t = i = t + 1 n ( ˜ i S ˜ i ) t S t .

Proof

MFI n = n and S n = . According to Lemma 4 and Theorem 3, MFI n ˜ can be obtained by deleting the Re- ( n 1 ) subset of n , and MFI n 1 = ( n n n 1 ) n 1 S n 1 . Similarly, MFI n 2 = ( n n n 1 n n 2 ) ( n 1 n 1 n 2 ) ( S n 1 S n 1 n 2 ) n 2 S n 2 . Generally,

MFI s = j = s + 1 n ( j j j 1 j s ) ( S j S j j 1 S j s ) s S s ,

where 0 s n 1 , completing the proof.□

Proposition 3

Let i s ( 0 i n ) be the number of disjoint subsets of the set i = { I I , Ψ ( I ) = i } as defined in Definition 3.

  1. 0 s + 1 s + 2 s + + n s m .

  2. i = p n S i P 2 ( p + 1 ) s + ( p + 2 ) s + + n s 1 , where P is the sub-semigroup of ( X , ) , generated by { X ˜ j ( k ) 1 j k s , p + 1 k n } .

Proof

(1) Straightforward by the definitions of basic itemsets.

(2) By Definition 3, i = p n S i P , and P is

C ( p + 1 ) s + ( p + 2 ) s + + n s 1 + C ( p + 1 ) s + + n s 2 + + C ( p + 1 ) s + + n s ( p + 1 ) s + + n s = 2 ( p + 1 ) s + + n s 1 ,

since for any basic itemset X ˜ j ( k ) , the product of arbitrary number of X ˜ j ( k ) ’s is still X ˜ j ( k ) .□

By Theorem 3, we directly have the following algorithm for discovering maximal frequent itemsets.

Algorithm 1

Discovery of maximal frequent itemsets.

  1. MFI n = { I Ψ ( { I } ) = T , I } , and the minimum support threshold t .

  2. MFI t .

  3. For t i n , find the items from such that the support of each item is i . Denote the set of these items by ( i ) .

  4. Divide ( i ) into disjoint subsets X 1 ( i ) , , X i s ( i ) such that for any two elements I 1 , I 2 ( i ) , Ψ ( { I 1 } ) = Ψ ( { I 2 } ) if and only if I 1 , I 2 are in the same subset.

  5. For X j ( i ) ( 1 j i s , t i n ), find the subset { I Ψ ( X j ( i ) ) Ψ ( { I } ) , I } . Denote it by X ˜ j ( i ) .

  6. Discovery of sim-basic itemsets.

    1. Let X ˜ be the set { X ˜ j ( i ) i = t + 1 , , n , j = 1 , 2 , , i s } , and i 0 = t + 1 .

    2. For X ˜ j ( i 0 ) X ˜ , find the subset { X ˜ j ( k ) Ψ ( X ˜ j ( i 0 ) ) Ψ ( X ˜ j ( k ) ) , X ˜ j ( k ) X ˜ } , denoted by H j ( i 0 ) .

    3. Let X ˜ = X ˜ j = 1 i s H j ( i 0 ) , and i 0 = i 0 + 1 . If X ˜ , return to Step 4.2.

    4. Let P j ( i 0 ) be the product of elements in H j ( i 0 ) under , and P be the sub-semigroup of ( X , ) , generated by { P j ( i ) 1 j i s , t + 1 i n } .

    5. Denote P { P j ( i ) 1 j i s , t + 1 i n } by P . Divide P into disjoint subsets { P 1 , P 2 , , P u } such that for A , B P , A and B are in the same subset if and only if Ψ ( A ) = Ψ ( B ) . For P i , let X i be the product of all the elements in P i under . Then let be { X 1 , , X u } .

  7. Computation of MFI t .

    1. For t i n , let i be { X ˜ j ( i ) 1 j i s } , and let S i be { X j Ψ ( X j ) = i , X j } .

    2. For t + 1 i n and X i (or S i ), if there exists I such that Ψ ( { I } X ) = k with t k i 1 , then delete X from i (or S i ) to obtain a new set denoted by ˜ i (or S ˜ i ).

    3. Output MFI t = i = t + 1 n ( ˜ i S ˜ i ) t S t .

Proposition 4

Let P j 1 ( i 1 ) P j 2 ( i 1 ) P j q ( i q ) be an element in P with q 2 . Then for any I , we have Ψ ( P j 1 ( i 1 ) P j 2 ( i 1 ) P j q ( i q ) ) Ψ ( I ) , where P j 1 i 1 is the product of elements in H j 1 ( i 1 ) under , and P is the semigroup generated by { P j ( i ) 1 j i s , t + 1 i n } , as defined in Step 4 of Algorithm 1.

Proof

Steps 4.1–4.3 imply that for H j ( i ) , there does not exist I such that Ψ ( I ) Ψ ( P j ( i ) ) , where P j ( i ) is the product of elements in H j ( i ) under . Therefore, for any two different sets H j ( i ) and H j 0 ( i 0 ) , Ψ ( P j ( i ) ) Ψ ( P j 0 ( i 0 ) ) and Ψ ( P j 0 ( i 0 ) ) Ψ ( P j ( i ) ) . Consequently, there does not exist I such that Ψ ( I ) = Ψ ( P j 1 ( i 1 ) P j 2 ( i 1 ) P j q ( i q ) ) , completing the proof.□

Remark 2

  1. Steps 1–3 are to find basic itemsets.

  2. According to Proposition 4, Step 4 is to find closed itemsets which meet the condition in Proposition 2, hence Step 4 can discover sim-basic itemsets under .

Example 1

Suppose a database has six transactions and five items, as shown in Table 1. Let the minimum support threshold t be 3. Then MFI 3 can be obtained by Algorithm 1.

  1. For 3 i 6 , find the set of items whose support is i , and denote it by ( i ) . We have ( 4 ) = { B , C } , and ( 5 ) = { D , E } .

  2. Divide ( 4 ) into disjoint subsets X 1 ( 4 ) = { B } , X 2 ( 4 ) = { C } . Similarly, X 1 ( 5 ) = ( 5 ) = { D , E } .

  3. For X j ( i ) ( 1 j i s , t i n ), find the subset X ˜ j ( i ) = { I Ψ ( X j ( i ) ) Ψ ( { I } ) , I } . We have X ˜ 1 ( 4 ) = { B , D , E } , X ˜ 2 ( 4 ) = { C } , and X ˜ 1 ( 5 ) = { D , E } .

  4. H 1 ( 4 ) = { X ˜ 1 ( 4 ) , X ˜ 1 ( 5 ) } , H 2 ( 4 ) = { X ˜ 2 ( 4 ) , X ˜ 1 ( 5 ) } , and H 1 ( 5 ) = { X ˜ 1 ( 5 ) } . Then P 1 ( 4 ) = X ˜ 1 ( 4 ) X ˜ 1 ( 5 ) , P 2 ( 4 ) = X ˜ 2 ( 4 ) X ˜ 1 ( 5 ) , and P 1 ( 5 ) = { X ˜ 1 ( 5 ) } .

  5. P = = { X ˜ 1 ( 4 ) X ˜ 2 ( 4 ) X ˜ 1 ( 5 ) } .

  6. From Steps 4 and 5, 3 = , S 3 = { X ˜ 1 ( 4 ) X ˜ 2 ( 4 ) X ˜ 1 ( 5 ) } , 4 = { X ˜ 1 ( 4 ) , X ˜ 2 ( 4 ) } , S 4 = , 5 = { X ˜ 1 ( 5 ) } , S 5 = .

  7. Since Ψ ( X ˜ 1 ( 4 ) { C } ) = 3 and Ψ ( X ˜ 2 ( 4 ) { B } ) = 3 , ˜ 4 = . And ˜ 5 = since Ψ ( X ˜ 1 ( 5 ) { B } ) = 4 .

  8. Output MFI 3 = S 3 = { X ˜ 1 ( 4 ) X ˜ 2 ( 4 ) X ˜ 1 ( 5 ) } = { { B , C , D , E } } , since 3 = , and ˜ i = S ˜ i = for i = 4 , 5 .

Table 1

Sample database

Transactions Items
1 BCDE
2 ABDE
3 BCDE
4 ABCDE
5 DE
6 C

Lemma 5

Suppose A , B are two sorted sets in ascending order, and A = p , B = q , then the complexity of computing A B is p + q .

Proof

A B can be obtained by carrying out the following steps.

Step 1. Let C = .

Step 2. If A and B , carry out Step 3. Otherwise, output C .

Step 3. Suppose a 1 and b 1 are the first elements of A and B , respectively. If a 1 < b 1 , then delete a 1 to obtain a new set denoted by A and return to Step 2. If a 1 = b 1 , then add a 1 to C , delete a 1 and b 1 from A and B , respectively, to obtain new sets denoted by A and B , respectively, and return to Step 2. If a 1 > b 1 , then delete b 1 from B to obtain a new set denoted by B , and return to Step 2.

Since at least one element is deleted from A B in each Step 3, the complexity of computing A B is p + q .□

Proposition 5

The complexity of discovering maximal frequent itemsets for min sup = t is less than O ( 3 m n 2 β + 4 β n ) , where β is ( t + 1 ) s + + n s , and i s ( t + 1 i n ) is the number of disjoint subsets of the set i = { I I , Ψ ( I ) = i } as defined in Definition 3.

Proof

We define an order on T as t i < t j if and only if i < j .

First of all, for I , Ψ ( { I } ) can be obtained as a sorted set in ascending order by comparing I with the elements in t i T from i = 1 to i = n , hence the complexity of computing { Ψ ( { I } ) I } is m 2 n .

Steps 1 and 2 of Algorithm 1 are to divide into disjoint subsets by comparing Ψ ( { I 1 } ) with Ψ ( { I 2 } ) for every two elements I 1 , I 2 in . So the complexity is at most 2 n C m 2 by Lemma 5.

Step 3 is to find X ˜ j ( i ) also by making comparisons between Ψ ( X j ( i ) ) and Ψ ( { I } ) . And we do not need to compute Ψ ( X j ( i ) ) since Ψ ( X j ( i ) ) = Ψ ( { I } ) for any element I in X j ( i ) . So the complexity is also at most 2 n C m 2 by Lemma 5. Besides, there is no need to compute Ψ ( X ˜ j ( i ) ) in the following Step 4, since Ψ ( X ˜ j ( i ) ) = Ψ ( X j ( i ) ) by Remark 1.

In Step 4, Steps 4.1–4.3 make comparisons between every two elements in X ˜ , hence the complexity is at most C β 2 , where β denotes ( t + 1 ) s + + n s .

The complexity of computing { Ψ ( A ) A P } is at most ( 2 β 1 ) n β , noting that the element number of P is less than 2 β 1 by Proposition 3 and Ψ ( A ) is an intersection of some elements in { Ψ ( X ˜ j ( i ) ) 1 j i s , t + 1 i n } since Ψ ( P j i ) = Ψ ( X ˜ j i ) .

The complexity of Step 4.5 is less than 2 n C 2 β 1 2 , since Step 4.5 compares Ψ ( A ) with Ψ ( B ) for every two elements A , B in P .

Step 5 is to compute Ψ ( { I } X ) = Ψ ( { I } ) Ψ ( X ) , where I , and X is an element in i S i with t + 1 i n . Hence, the complexity of Step 5 is at most i = t + 1 n ( i + S i ) m ( n + n ) , which is less than 2 ( β + 2 β ) m n according to Proposition 3.

Therefore, the total complexity of Algorithm 1 is less than m 2 n + 4 C m 2 n + C β 2 + 2 β n β + 2 C 2 β 1 2 n + 2 ( β + 2 β ) m n , or O ( 3 m n 2 β + 4 β n ) .□

Remark 3

Eppstein [8] gave an O ( l 3 4 l ( m + n ) ) algorithm to generate all maximal frequent itemsets by graph theory, where l denotes the maximum of C Ψ ( C ) / ( C + Ψ ( C ) 1 ) , with the maximum taken over all maximal frequent itemsets C . Since n , t m in practical applications, Ψ ( C ) C . Hence, C Ψ ( C ) / ( C + Ψ ( C ) 1 ) is nearly m in the worst case. The complexity of Algorithm 1 in the worst case is O ( 3 m n 2 m + 4 m n ) , which is much lower than O ( m 3 4 m ( m + n ) ) .

6 Experiments

Besides the theoretical analysis, we use Python 3.5 to evaluate the performance of Algorithm 1 on four different datasets C h e s s , M u s h r o o m , T 40 I 10 D 100 K , and T 10 I 4 D 100 K from UCI that have been commonly used in previous research. All experiments are performed on a computer having Microsoft Windows 7 operating system, a core i7 CPU, and 8 GB of RAM. Our algorithm is compared with the three well-known deterministic algorithms for frequent itemset mining: ECLAT [20], Apriori [21], and FP-Growth [22].

Since we mainly considered complexities of algorithms in this article, we use the execution time as the performance measure. We examined each dataset with various minimum support thresholds and saved the execution time of each algorithm.

The results in Figure 1 indicate that Algorithm 1 is performing better than the other approaches considering their execution time.

Figure 1 
               Execution time vs minimum support (a) using Chess dataset, (b) using Mushroom dataset, (c) using 
                     
                        
                        
                           T
                           40
                           I
                           10
                           D
                           100
                           K
                        
                        T40I10D100K
                     
                   dataset, and (d) using 
                     
                        
                        
                           T
                           10
                           I
                           4
                           D
                           100
                           K
                        
                        T10I4D100K
                     
                   dataset.
Figure 1

Execution time vs minimum support (a) using Chess dataset, (b) using Mushroom dataset, (c) using T 40 I 10 D 100 K dataset, and (d) using T 10 I 4 D 100 K dataset.

7 Conclusion

We have presented methods to discover maximal frequent itemsets from the perspective of semigroup algebra and proved that the complexity of the methods is much lower than the known complexity in the worst case. Experiments made on four commonly used datasets also show that the algorithm based on our method performs better than the other three well-known algorithms. Meanwhile, we provided explicit forms of simplified generators of frequent itemsets, proved that the simplified generators are maximal frequent itemsets and vice versa, provided a necessary and sufficient condition for a maximal i + 1 -frequent itemset being a subset of a closed i -frequent itemset, and put forward a recurrence formula of maximal frequent itemsets by defining basic itemsets and sim-basic itemsets.

We also explored some algebraic properties of rule mining, which can be used to investigate other basic problems such as more efficient algorithms for discovering closed frequent itemsets, generators of association rules and reducing redundant association rules in further work.

Acknowledgments

The authors sincerely thank the referees for their constructive comments and fruitful s