An algebraic semigroup method for discovering maximal frequent itemsets

: Discovering maximal frequent itemsets is an important issue and key technique in many data mining problems such as association rule mining. In the literature, generating maximal frequent itemsets proves either to be NP - hard or to have O l m n 4 l 3 ( ( )) + complexity in the worst case from the perspective of generating maximal complete bipartite graphs of a bipartite graph, where m , n are the item number and the transaction number, respectively, and l denotes the maximum of C

( ( )) + complexity in the worst case from the perspective of generating maximal complete bipartite graphs of a bipartite graph, where m, n are the item number and the transaction number, respectively, and l denotes the maximum of C C C C Ψ Ψ 1 | || ( )| (| | | ( )| ) / + − , with the maximum taken over all maximal frequent itemsets C. In this article, we put forward a method for discovering maximal frequent itemsets, whose complexity is O mn n 3 2 4 β β ( ) + , lower than the known complexity both in the worst case, from the perspective of semigroup algebra, where β is the number of items whose support is more than the minimum support threshold. Experiments also show that an algorithm based on the algebraic method performs better than the other three well-known algorithms. Meanwhile, we explore some algebraic properties with respect to items and transactions, prove that the maximal frequent itemsets are exactly the simplified generators of frequent itemsets, give a necessary and sufficient condition for a maximal i 1 + -frequent itemset being a subset of a closed i-frequent itemset, and provide a recurrence formula of maximal frequent itemsets.

Introduction
In 1993, Agrawal et al. [1] introduced the problem of mining a large collection of basket data-type transactions for association rules between sets of items with minimal confidence threshold and presented an efficient algorithm for this purpose. Association rule mining has since been a research focus and has become a key technique in data mining, with broad and successful applications, such as in market analysis, financial investment, health, environmental protection, product manufacturing, and made a very significant social and economic benefit.
A key component in association rule mining problem (e.g., [2][3][4]) and other data mining problems such as episode [5] and minimal keys [6] is to discover frequent itemsets and maximal frequent itemsets. Maximal frequent itemset (MFI) can be used to improve the performance of discovering frequent itemsets. Furthermore, it suffices to know only the maximal frequent itemsets in many data mining applications, such as the minimal key discovery and the theory extraction [6].
Most of the research work with respect to discovering maximal frequent itemsets focused on developing deterministic algorithms (e.g., Bayardo [7]; Eppstein [8]; Lin and Kedem [9]; Boros et al. [10]; Dhabu and Deshpande [11]; Kabir et al. [12]; Karim et al. [13]; Halim et al. [14]), or complexity analysis (Eppstein [8]; Boros et al. [10]), or developing approximation algorithms (e.g., Fatemi et al. [15]; Zhang et al. [16]), or developing maximal frequent itemsets with added constraints such as maximal diverse frequent itemsets (e.g., Wu et al. [17]). Boros et al. [10] claimed that maximal t-frequent sets cannot be generated efficiently, unless P NP = . Eppstein [8] turned the problem of discovering maximal frequent itemsets into finding all maximal complete bipartite graphs of a bipartite graph on m n + vertices and gave an algorithm with the complexity O l m n 4 l 3 ( ( )) + in the worst case, where m, n are the item number and the transaction number, respectively, and l denotes the maximum of with the maximum taken over all maximal frequent itemsets C.
In this article, we put forward a semigroup algebraic method for mining maximal frequent itemsets, whose complexity is O mn (in practical applications), lower than the known complexity O l m n 4 l 3 ( ( )) + given by Eppstein [8] both in the worst case, where β is the number of items whose support is more than the minimum support threshold.
First, we use algebraic language to describe the natural algebraic structure of items and transactions and explore some algebraic properties. For instance, there is a semigroup homomorphism between itemsets and transactions.
Under the algebraic framework, we give the explicit forms of simplified generators of frequent itemsets and prove that the simplified generators are maximal frequent itemsets and vice versa. Then we define basic itemsets and sim-basic itemsets and use them to construct the recurrence formula of maximal frequent itemsets. Finally, the recurrence formula is used to provide methods for discovering maximal frequent itemsets whose complexity proves to be much lower than the known complexity both in the worst case. Experimental comparisons also indicate a better performance of an algorithm based on the method.

Preliminaries
In this section, we briefly review some basic notations about algebraic semigroup and frequent itemset mining. Interested readers can refer to relative literature (e.g., Agrawal et al. [1]; Clifford and Preston [18]) for more details or more information with respect to the approaches to frequent itemset mining (e.g., Luna [19]).
Let be a set of items, and be a set of transactions, where each transaction has a unique identifier and contains a set of items (also called itemset). The support of an itemset X, denoted by X supp( ), is the number of transactions in which it occurs as a subset. An itemset is frequent if its support is not less than a pre-specified minimum support threshold min sup . X is a maximal frequent itemset if there is no other itemset X′ such that X′ is frequent and ) of S S × will be denoted by a b ⋅ . Frequently, we shall omit the dot, writing ab for a b ⋅ . Other symbols which may be used to denote binary operations are , , + ∘ * . A semigroup is a set S together with a binary operation ( ) ⋅ , such that the operation is associative, i.e., a b c S , , Denote the semigroup by S( ) ⋅ or S for simplicity when there is no danger of ambiguity, and say that S is a semigroup with respect to ⋅ . If a b b a ⋅ = ⋅ , then S is called a commutative semigroup. If e S ∈ and a S ∀ ∈ , a e a e a ⋅ = = ⋅ , then e is an identity element. A subset T of S is called a sub-semigroup of S if a T ∈ and b T ∈ imply ab T ∈ . Suppose T is a non-empty subset of S. The sub-semigroup T ⟨ ⟩ is the set of all elements of S expressible as finite products of elements of T . T is called a set of generators of T ⟨ ⟩.

Algebraic properties
In this section, we investigate the algebraic semigroup structure and some basic algebraic properties for items and transactions.
In what follows, we always assume that the element number of item set is m, and the element number of transaction set is n. We use 2 to denote the power set of , i.e., the set of all subsets of . And we denote the power set of by 2 .
For any two elements X X , 2 From (i) and (iv), we directly have Similarly, 2 ( ) ∘ and 2 ( ) * are also commutative monoids, which take ∅ and as their identity elements.
The following mapping describes the support of an itemset X.
We define a mapping Ψ : 2 2 ⟶ as follows: For X 2 ∈ and X ≠ ∅, first find all the transactions t t t , , , s In fact, the mapping described in Definition 1 is a homomorphism.
Theorem 1. (Homomorphism between itemsets and transactions) Ψ is a homomorphism, i.e., if X X , 2 then remove i from Γ. Finally, we obtain the set Γ′. Then (1) In other words, can generate all the frequent itemsets.
∈ ′ the set of simplified generators for min sup .
Suppose can generate all the frequent itemsets, and A B ⊈ for any two elements A B , ∈ . Then is the set of maximal frequent itemsets.
Proof. This can be directly verified by the definition of maximal frequent itemset. □ The following statement shows that the maximal frequent itemsets are simplified generators.
∈ ′ be the set of simplified generators. Then it is the set of maximal frequent itemsets.
Proof. By Theorem 2, ∈ ′ can generate all the frequent itemsets, and for any j i ≠ , I I , , ∈ ′ is the set of maximal frequent itemsets by Lemma 2.

Maximal frequent itemset discovering and complexity analysis
In this section, we first define basic itemset and sim-basic itemset and give a necessary and sufficient condition for a maximal i 1 + -frequent itemset being a subset of a closed i-frequent itemset, then provide a recurrence formula for maximal frequent itemsets. Finally, an algorithm for discovering maximal frequent itemsets is presented, based on the recurrence formula, and the complexity proves to be smaller than the known complexity given in [8] both in the worst case.
Given a minimal support threshold t, a maximal frequent itemset whose support is t is either a basic itemset or a sim-basic itemset, which will be proved in Theorem 3. Basic itemsets and sim-basic itemsets are defined as follows.
, such that for any two elements I I , if and only if } ∈ , and call it a basic itemset (for i).

Let
be the sub-semigroup of , , then we call X 0 a sim-basic itemset (for i).
There is an equivalent description of basic itemsets and sim-basic itemsets as follows.

Proposition 2.
(1) X is a basic itemset if and only if X is a closed itemset and there exists I 0 ∈ such that (2) X is a sim-basic itemset if and only if X is a closed itemset and for any I ∈ .
Proof. (1) Suppose X is a basic itemset. By Definition 3, there exists I 0 ∈ such that implying that X is a closed itemset. Conversely, suppose X is a closed itemset and there exists I 0 ∈ such that According to the definition of closed itemsets, I X 0 ∈ . Let X′ be the basic itemset . Then X′ is a closed itemset. Since X and X′ are both closed itemsets and X X I (2) Suppose X is a sim-basic itemset whose support is i ( i n 1 ≤ ≤ ). Assume that X is not a closed itemset, then there exists an itemset Y 0 such that , contradicting with Definition 3. Therefore, X is a closed itemset. Now we assume that there exists I′ ∈ , such that Then I X ′ ∈ , since X is a closed itemset.
Therefore, there exists a basic itemset X j by the definition of sim-basic itemset. This contradicts with the fact is a closed itemset whose support is i and for any Therefore, to prove that X is a sim-basic itemset, it suffices to prove that for any basic itemset X X by the definition of basic itemset, I X 0 ∉ ′ and The following is a preparation for Theorem 3. ∈ . And for by the definition of sim-basic itemset. Hence, Therefore, X 1 cannot be a subset of X 0 . □ Based on basic itemsets and sim-basic itemsets, the recurrence formula of maximal frequent itemsets can be deduced as follows. Proof. Suppose X MFI i ∈ . There are three cases.
for any I ∈ , then X is a closed itemset since it is a maximal frequent itemset. Furthermore, X is a sim-basic itemset by Proposition 2, i.e., X i ∈ .
, then X is the G-frequent itemset for X Ψ( ) by Theorem 2 and Corollary 1, i.e., X I X + for the following reasons. Assume that X MFI i 1  ∉ + , then there exists a maximal frequent itemset for i min 1 sup = + , denoted by X 0 , such that X X 0 ⊂ . Since X 0 is also a frequent itemset for i min sup = , X 0 is a subset of a maximal frequent itemset for i min sup = . Therefore, X is a proper subset of a maximal frequent itemset for i min sup = , which contradicts with the fact that X is a maximal frequent itemset for i min sup = .
Proof. Suppose there exists I 0 ∈ such that Then X must be a subset of a closed itemset whose support is i.
is the set of closed itemsets whose support is i according to Proposition 2, X is a subset of an element in i i ⋃ . Suppose X is a subset of an element in i i ⋃ . Then there exists an itemset X 0 such that X X i Ψ 0 | ( )| ∘ = and X X 0 ∘ is a closed itemset. Let I 0 be an element in X 0 . Assume that Proof. Step 1. For t i n ≤ ≤ , find the items from such that the support of each item is i. Denote the set of these items by i ( ) .
Step 2. Divide i ( ) into disjoint subsets X X , , such that for any two elements I I , ({ }) if and only if I I , 1 2 are in the same subset. Step Step 4. Discovery of sim-basic itemsets. Step 5. Computation of MFI t .

For t i n
then delete X from i (or i ) to obtain a new set denoted by i  (or i ͠ ). (1) Steps 1-3 are to find basic itemsets.

Output MFI
(2) According to Proposition 4, Step 4 is to find closed itemsets which meet the condition in Proposition 2, hence Step 4 can discover sim-basic itemsets under ∘ .
Example 1. Suppose a database has six transactions and five items, as shown in Table 1. Let the minimum support threshold t be 3. Then MFI 3 can be obtained by Algorithm 1.
Step 1. For i 3 6 ≤ ≤ , find the set of items whose support is i, and denote it by i ( ) . We have B C , Step 2. Divide 4 ( ) into disjoint subsets X B Similarly, X D E , Step 3. For Step 4. H X X , Step 5.
Step 6. From Steps 4 and 5, 3 = ∅ , X X X Step 7. Since Step 8. Output Lemma 5. Suppose A B , are two sorted sets in ascending order, and A p | | = , B q | | = , then the complexity of computing A B ⋂ is p q + .

Proof. A B
⋂ can be obtained by carrying out the following steps. Step 1. Let C = ∅.
Step 3. Suppose a 1 and b 1 are the first elements of A and B, respectively. If a b 1 1 < , then delete a 1 to obtain a new set denoted by A and return to Step 2. If a b 1 1 = , then add a 1 to C, delete a 1 and b 1 from A and B, respectively, to obtain new sets denoted by A and B, respectively, and return to Step 2. If a b 1 1 > , then delete b 1 from B to obtain a new set denoted by B, and return to Step 2.
Since at least one element is deleted from A B ⋂ in each Step 3, the complexity of computing A B ⋂ is p q + . □ ({ }) for every two elements I I , 1 2 in . So the complexity is at most nC 2 m 2 by Lemma 5.
Step 3 is to find X j i ͠ ( ) also by making comparisons between X Ψ j i ( ) ( ) and I Ψ({ }). And we do not need The complexity of computing , noting that the element number of ′ is less than 2 1 β − by Proposition 3 and A Ψ( ) is an intersection of some elements in The complexity of Step 4.5 is less than nC 2 ( ( )) + algorithm to generate all maximal frequent itemsets by graph theory, where l denotes the maximum of C C C C Ψ Ψ 1 | || ( )| (| | | ( )| ) / + − , with the maximum taken over all maximal frequent itemsets C. Since n t m , ≫ in practical applications,

Experiments
Besides the theoretical analysis, we use Python 3.5 to evaluate the performance of Algorithm 1 on four different datasets Chess Mushroom T I D K , , 40 10 100 , and T I D K 10 4 100 from UCI that have been commonly used in previous research. All experiments are performed on a computer having Microsoft Windows 7 operating system, a core i7 CPU, and 8 GB of RAM. Our algorithm is compared with the three well-known deterministic algorithms for frequent itemset mining: ECLAT [20], Apriori [21], and FP-Growth [22].
Since we mainly considered complexities of algorithms in this article, we use the execution time as the performance measure. We examined each dataset with various minimum support thresholds and saved the execution time of each algorithm.
The results in Figure 1 indicate that Algorithm 1 is performing better than the other approaches considering their execution time.

Conclusion
We have presented methods to discover maximal frequent itemsets from the perspective of semigroup algebra and proved that the complexity of the methods is much lower than the known complexity in the worst case. Experiments made on four commonly used datasets also show that the algorithm based on our method performs better than the other three well-known algorithms. Meanwhile, we provided explicit forms of simplified generators of frequent itemsets, proved that the simplified generators are maximal frequent itemsets and vice versa, provided a necessary and sufficient condition for a maximal i 1 + -frequent itemset being a subset of a closed i-frequent itemset, and put forward a recurrence formula of maximal frequent itemsets by defining basic itemsets and sim-basic itemsets.
We also explored some algebraic properties of rule mining, which can be used to investigate other basic problems such as more efficient algorithms for discovering closed frequent itemsets, generators of association rules and reducing redundant association rules in further work.
Acknowledgments: The authors sincerely thank the referees for their constructive comments and fruitful suggestions that helped improving this paper.  Author contributions: Jiang Liu and Feng Ni contributed in an overall design and the writing of this paper, put forward the main results and gave the proof; Jing Li, Xiang Xia, Shunlong Li, and Wenhui Dong contributed in the experiments.

Conflict of interest:
The authors state no conflict of interest.