Xi'an Technological University
Subject: Computer Science , Software Engineering
eISSN: 2470-8038
SEARCH WITHIN CONTENT
Zhu Ying ^{*} / Wang Jianguo ^{*}
Keywords : Hadoop, Tree Structure, Load Balancing, Frequent item Sets, Association Rules
Citation Information : International Journal of Advanced Network, Monitoring and Controls. Volume 3, Issue 4, Pages 7-16, DOI: https://doi.org/10.21307/ijanmc-2019-015
License : (CC-BY-NC-ND 4.0)
Published Online: 14-October-2019
Aiming at the problems of low efficiency, low cost of time and space, this paper proposes an incremental association rule mining algorithm based on Hadoop load balancing. In combination with the tree structure, when the data in the database is constantly updated, it does not need to scan the original database to generate frequent item sets, and use the load balancing in the data distribution so that the master node distributes the data to the child nodes evenly. In the experiment of control variable method, the variables of minimum support and sample increment are processed respectively. The experimental results show that when the minimum support is unchanged and the transaction data set is increased, the incremental association rule mining algorithm based on Hadoop load balancing takes less than 14.3% of the Apriori algorithm. The number of association rules mined by the algorithm is more than that of the Apriori algorithm. And the memory usage of the Hadoop-based incremental association rule mining algorithm is much smaller than the Apriori algorithm; when the total amount of transaction data is constant and the minimum support is changed, the memory usage of the Hadoop-based incremental association rule mining algorithm is smaller than the Apriori algorithm. The Hadoop-based incremental association rule mining algorithm has some improvements in memory usage and efficiency.
In today's era of big data, data is an important asset of information society, while data processing and management are also facing great challenges, such as data large amount of information, difficult to handle, difficult to obtain value. Data Mining [1] is a deep data intelligence analysis method. The method involves knowledge of many related fields such as machine learning, deep learning, artificial intelligence, and information retrieval. It is mainly used to extract potential information data from a massive data set that can contribute to the value of business decisions.Association rule mining technology [2] is a very important part of data mining technology. Under the big data environment, it can find many valuable information from the huge and irregular data by discovering the relationship among the items in the data set. Research on association rule algorithm in big datatechnology environment is a very important and challenging research topic. Considering that the data in the database is constantly updated, the incremental association rule updating techniques have been proposed to effectively maintain the association rules of updated data.
Literatures[3-4] proposed incremental association rule update technology, it is a highly efficient maintenance of data updating association rules, the literature describes in detail the characteristics of the technology. In the literature [5], the frequent item sets mining algorithm of association rules is proposed, which is an important component of association rule analysis, data statistics, causality and other mining tasks. The idea of parallel implementation of algorithm is proposed in the literature [6]. Data is distributed by the master node to each node to reduce the load pressure. In the literature [7], the FUP algorithm needs to scan the original database several times to solve the frequent item sets. Of the data environment, the mining efficiency is low.
In the literature [8-9], the paper introduces the characteristics and structure of the B+ tree, also mentioned the one-dimensional data index structure in the database.
When dealing with large amounts of data, processing efficiency can be compromised due to high server load.Therefore, this paper presents proposes an incremental association rule mining algorithm based on Hadoop load balancing. The implementation of this algorithm uses the Map/ Reduce programming model to parallelize the mining tasks, and uses load balancing to mine frequent item sets in the clusters to get the association rules.
When calculating the frequent item sets, the FUP algorithm [10] makes use of the known support of the old frequent item sets, and then through the frequent analysis of the candidate item sets in DB and db, many candidates can be filtered out in the process of determining frequent item sets. Therefore, it is more efficient than re-running the Apriori algorithm, especially if the old and new frequent sets are not much different. However, like the Apriori algorithm, the FUP algorithm must spend a lot of time processing a large set of candidate items and it is necessary to repeatedly scan the database DB and db to pattern match the candidate sets. Since the FUP algorithm uses L_{k−1} to generate C_{k}, the generated candidate set of db is large, and many of them are infrequent, and some even are impossible to appear in db, which affects the efficiency of the algorithm. In addition, the FUP algorithm needs to perform K+1 layer scanning on the entire database DBUdb. In general, the original database DB is very large, so the k+1 layer scanning for the DB is not efficient.
The shortcomings of the current incremental update association mining algorithm:
1) When the database is updated, the generation of frequent item sets will repeatedly scan the processed data sets, so that the mining efficiency is not high;
2) generating a large number of candidate sets;
3) With the addition of the database, the minimum support threshold and the minimum confidence threshold do not change, which will affect the mining of strong association rules;
Due to the inefficiency of FUP algorithm in dealing with big data mining, this paper proposes an improved algorithm of FUP, Incremental Association Rule Mining Algorithm Based on Hadoop Load Balancing. Combined with the Hadoop distributed platform and the use of a tree structure to store information, it is only necessary to scan the DB and dbone time, which can effectively improve the scanning efficiency. When the database is updated, the Newton interpolation formula is used to estimate the support threshold to efficiently process the association rules.
A traditional database can be represented by a two-dimensional matrix. Each column of a two-dimensional matrix can be treated as a project, and each row can be treated as a transaction of the database. In the traditional Apriori algorithm, horizontal data is used to represent a transaction. The typical method of implementing two-dimensional matrix is horizontal data. Most data mining algorithms use database access to calculate their support. In relational databases, the size of the data in the transaction database is relatively large, and there is a problem with the size of the computer memory at present, and there will be bottlenecks. This will lead to a contradiction: due to the continuous increase of data in the database. The size of the data will be larger and larger, but the memory and hard disk of the computer are limited. Therefore, when the data in the database reaches a certain scale, the computer memory and the hard disk will not be stored enough. Therefore, a new type of data representation is generated—vertical data representation. Vertical data means that each record in the database consists of a list of all the transaction records that it has appeared, as shown in Tab.2:
The B+ tree [12] is a tree data structure, which is an n-fork sort tree. It is a variant tree of B-trees required by the file system. It features:
1) There are n keywords in the node of n sub-trees. Each keyword does not save data, only used for indexing, and all data is stored in leaf nodes.
2) All leaf nodes contain information about all keywords, and pointers to records containing these keywords, and the leaf nodes themselves are linked in order from the smallest to the largest.
3) All non-terminal nodes can be thought of as index parts, which contain only the largest (or smallest) keyword in their sub-tree (root node).
As shown in Fig. 1, it is a B+ tree of m=3(consisting of information in Tab.2).
In this era of increasing data size, the traditional frequent item set mining algorithm has been difficult to support the mining of its massive transaction database. Even if it can be mined, it will lead to a particularly long mining time, or it may be due to the huge amount of data. Causes the program to fail. The solution to the problem of the traditional frequent item set mining algorithm is to add the Hadoop distributed platform to the frequent item set mining algorithm, which not only makes the traditional frequent item set mining algorithm run more efficiently, but also reduces the storage pressure of the database.
Based on the original algorithm, there are some disadvantages of generating a large number of candidate sets and repeated retrieval processing of the original database. Combined with the characteristics of B+ tree, the data information is stored. Based on FUP algorithm, the incremental association rule mining algorithm under Hadoop platform is proposed: HBPT-FUP algorithm(Incremental Association Rule Mining Algorithm Based on Hadoop Load Balancing, Improved Algorithm of FUP Algorithm).
1) Basic ideas
① Statistics in the database transaction items, get transaction items set. ② Setup process load estimation, through the Map process using load balancing grouping way to read the transaction items, distributed to different Reduce nodes. The Reduce process is based on how the B+ tree is created and stores transaction item information. The bottommost leaf node of B+ tree contains all the item sets. The frequent item sets are obtained by comparing the number of items with the minimum support, and other infrequent item sets remain in their leaf nodes to ensure that future data updating become frequent nodes. The local frequent item sets are obtained by the B+ tree created by the Reduce nodes, and the local frequent item sets are merged into the global frequent item sets to get the association rules.
Compared with the previous incremental updating correlation mining algorithm, the algorithm has the characteristics of load balancing. When dealing with large amounts of data, due to a single server load is too high affect the processing speed, so need to use server clusters to deal with. The load factor is added to the estimation of the load, which can calculate the load of each frequent item more accurately.
The algorithm takes advantage of the B+ balanced tree properties. The leaf nodes at the bottom of the tree contain all the keywords of the whole tree, and they are linked together like a doubly linked list.
2) Estimated load
The load in the parallelization process is equal to the sum of the load on each node [11]. supposed that the load corresponding to the data item is L_{i}, the position in the frequent item set is P_{i}, the load influence factor is a_{i} (the frequency of the data item):
3) Equalization of data
Frequent items in frequent item sets are divided into Q groups according to the balanced grouping strategy, which makes the data balance in the distribution process so as to improve the parallelization of the mining process.
If Q is less than the length of frequent itemsets, the Q items of frequent itemsets are assigned to each group in turn. The group's load is initialized by the sum of the frequent item loads in each group. And then repeat the following two steps until all the frequent items are assigned: ① To assign unallocated frequent items to the group with the lowest load; ② The load in each group is added up.
If Q is greater than the length of frequent itemsets, the first x of the frequent itemsets is assigned to the x group, which initializes the group load according to the load of the frequent items it contains, and then repeats the above two steps until all Frequent items are assigned.
4) Threshold setting
The minimum support threshold will change with the user's needs and the database update. When the threshold is set too low, the more rules are mined, but the rule is less useful. Conversely, when the threshold is set too high, there are very few rules for mining, which will lose some useful rules. Therefore, it is very important to set the appropriate threshold when dealing with incremental databases.
Definition 1: The support degree Support(X) of item set X is one of the most important metrics in association rule mining. It represents the probability that the item set {X, Y} will appear in the total item set. The formula is:
Where I is the total transaction sets.num() represents the number of occurrences of a particular item set in a transaction set.
Definition 2: Confidence indicates the probability that Y is derived by the association rule X → Y in the event that the prerequisite X occurs. That is, in the item set containing X, there is a possibility of Y. The formula is:
The Newton interpolation formula [13] that can be used to derive the support threshold from the above formula is:
Where C is the confidence level, n is the number of rules, and M is the total number of attributes in the transaction set.
When the association rule is first mined, the setting of the minimum support threshold is a trial and error process. When the transaction database is updated, it is possible that the previously set threshold is no longer applicable, so an adjustment threshold is required.
In order to improve the accuracy of the estimation, the order of the Newton interpolation polynomial can be improved as the number of executions of the algorithm increases. Because the algorithm generates a parameter pair (x_{k}, Sup_{k}) each time it runs. If this node is farther from the nearest two nearest interpolation nodes in the interpolation formula, you can add this node as a new interpolation node and adjust the interpolation formula. The accuracy of the adjustment of the adjusted interpolation formula at point is greatly improved.
Newton interpolation formula automatically determines the specific implementation process of support threshold, as follows:
Statistics are performed on some data in the experimental transaction data set (shopping data record), and the support degree and confidence corresponding to the rule RD → CD are calculated, as shown in Tab.3 (support, confidence in parentheses):
a) For the data in Tab.2, set the reliability threshold to 0.5, the support threshold to 0.3, and run frequent pattern mining. The association rules available according to Tab.3 are:
butter->ham,yogurt->ham,yogurt->bread,
cheese->ham,cheese->bread,
cheese->coffee.
b) Set the reliability threshold to 0.5 and the support threshold to 0.4. Run frequent pattern mining. The association rules available according to Tab.3 are:
butter->ham, yogurt->bread, cheese->ham.
c) Set the reliability threshold to 0.5 and the support threshold to 0.6. Run frequent pattern mining. The association rules available according to Tab.3 are:
yogurt->bread.
The Newton interpolation formula for the known support threshold is $\text{Sup}=\text{f}(\frac{\text{C*n}}{\text{M}})$. Then can get the following three expressions:
So you can get three interpolation nodes (x_{k}, Sup_{k}) as:(0.4375,0.3),(0.25,0.4),(0.0625, 0.6).
According to these three interpolation nodes, the difference quotient is shown in Tab.4:
From Tab.4, a second-order Newton interpolation polynomial can be written:
At this time, the support threshold can be estimated according to the Newton interpolation formula N_{2}(x) and the number of association rules expected by the user.
a) Determining the association rules that the user desires to mine (when the database update is about to be performed next time, the number of expected association rules is set to the average value of the total number of association rules of the previous mining);
b) Calculate the value of X according to the formula $\text{x}=\frac{C\text{*}n}{M};$
c) Substituting the value of X into N_{2}(x) to calculate the estimated value of support Sup.
Assuming that the user expects to mine three association rules, the calculated x value is:
Then the X value is substituted into N_{2}(x) to get the support: Sup = N_{2}(x) = 0.455. Sup can be chosen to be a value close to 0.455. For example, take sup=0.45 as the support threshold to perform the mining algorithm.
1) Brief description of the algorithm steps
Using the tree structure to store information, a Hadoop-based incremental association rule mining algorithm HBPT-FUP is proposed. The algorithm is described as follows:
Step 1: traversing the transaction database, processing according to the vertical data representation to get the item set. In the Setup phase: initialize the number of packets Q, distribute the transaction items of the primary node to the Q child nodes. A vertical data set is created in each node, and the vertical data set is traversed to obtain an item set inserted into the B+ tree.
Step 2: In each child node, all the frequent item sets of the group are obtained according to the generated B+ tree.
Step 3: When the transaction database updates the record, follow the first step to balance all incremental load to each node. The main focus is on the update algorithm of the B+ tree.
Add a keyword to the leaf node at the bottom of the B+ tree. If the number of keywords contained in the node does not exceed M, the insertion is successful. Otherwise, the node needs to be split. The number of keywords included in the two new nodes after splitting should not be less than (M/2 + 1). If the parent node of the node has enough space (M/2 ~ M-1), the parent node should contain the smallest keyword in the right node. Otherwise, the parent node is split and a new keyword is added to the parent of the parent node, followed by a higher order until the new root node is generated (1 ~ M-1).
Step 4: Association rules generated after database update.
2) Algorithm flowchart
In this experiment, the file retail (association rule study classic data set) downloaded from the UCI database was used as the experimental transaction data set, and the HBPT-FUP algorithm was compared with the Apriori algorithm. Use java language to simulate on Win7, dual-core 2.3GHZ CPU, 4GB memory PC.
1) Analysis of the time complexity of the algorithm
When the data is updated, ①only scans and updates part of the data in the database; ②scans the tree structure once and inserts the new item set. The analysis shows that when the minimum support is constant, the execution time of the algorithm is related to the amount of data updated each time. A small number of experimental samples are extracted from the dataset, and the minimum support degree of the Apriori algorithm is controlled (min_sup=0.45), and the amount of updated data is sequentially increased. The experimental time of the Apriori algorithm and the HBpT-Fup algorithm on a PC is recorded Figure 4 shows:
As shown in Figure 4, when the minimum support is constant, the execution time of the Apriori algorithm grows faster than the HBpT-Fup algorithm when the data set increases. When the data set is the same, the Apriori algorithm takes longer than the HBPT-FUP.
2) Comparison of the number of association rules mined.
When the database is updated, the minimum support threshold of the Apriori algorithm does not change (0.45), and the use of the Newton interpolation polynomial in the HBpT-Fup algorithm predicts the next minimum support threshold, which in turn makes the mined rules more efficient. In Experiment 1, the number of association rules mined each time is counted separately. The number comparison is shown in Figure 5:
As shown in Figure 5, when the amount of update data increases, the minimum support remains unchanged, and the number of association rules mined by the Apriori algorithm is less than that extracted by the HBpT-Fup algorithm. After adjusting the minimum support, the HBPT-FUP algorithm mines some rules that are easily missed, which increases the validity of the rules.
3) Analyze the spatial complexity of the algorithm
① In the B+ tree, only the item set and its frequency are stored, so the size of the occupied memory space is related to the total number of items; ② storing the frequent item sets determined by the minimum support, Therefore, the memory space is related to the minimum support.
a) Study the impact of the number of transaction item sets on memory usage. A small number of experimental samples are extracted from the data set, and the minimum support degree of the Apriori algorithm and the HBPT-FUP algorithm is controlled (min_sup=0.45), and the amount of updated data is sequentially increased. The Apriori algorithm and the memory usage of the HBPT-FuP algorithm on one of the PCs are recorded separately, as shown in Figure 6. (Set the minimum support threshold of the HBPT-FuP algorithm to remain the same)
As shown in Figure 6, when the minimum support is unchanged and the number of transaction items increases, the number of items generated increases, and the memory space occupied is larger. The Apriori algorithm updates the candidate set with the change of the number of transaction items. The memory space stores the candidate set and the frequent item set. The HBPT-FUP algorithm is based on distributed, and only stores data and its frequency in the tree structure of each node. Therefore, the Apriori algorithm takes up more memory space than the HBPT-FUP algorithm.
b) Study the impact of minimum support on memory usage. When the minimum support of the Apriori algorithm changes from 0.2 to 0.6, the data samples (10000) and increments (10000) remain unchanged, and the memory usage when running the Apriori algorithm and the HBPT-FUP algorithm on a PC is recorded separately. Figure 7 shows. (The minimum support threshold for the HBPT-FUP algorithm is automatically estimated based on the Newton interpolation polynomial when the database is updated)
As shown in Figure 7, the smaller the minimum support, the more the number of items generated, the larger the memory space occupied. In the case of the change of support degree, the Apriori algorithm updates the candidate set with the change of the support degree, and saves the candidate set and the frequent item set in the memory space. The HBPT-FuP algorithm does not generate the candidate set during the support change process, so The Apriori algorithm occupies more memory than the HBPT-FuP algorithm.
In this paper, a parallel update algorithm based on load balancing for incremental update association mining algorithm HBPT-FUP is proposed.The algorithm effectively realizes that when the database is updated, the original database is not scanned, and the newly added data is inserted into the tree based on the original B+ tree to obtain a frequent item set. Handle large-scale data with load balancing technology. Distribute data to different PCs in the cluster. The experimental results show that when the number of transaction records is the same as the amount of new data, the time spent by the HBPT-FUP algorithm is less than 14.3% of the Apriori algorithm, which improves the efficiency of mass data processing. When the minimum support changes, the memory space occupied by the HBPT-FuP algorithm on the same PC is much smaller than the Apriori algorithm. The algorithm has some improvements in terms of efficiency and memory usage.
First of all, I would like to thank my tutor during the master's degree, Professor Wang Jianguo. During my writing stage, Mr. Wang continued to give guidance and help, and taught me the direction of my own research. Mr. Wang has a wealth of professional domain knowledge and excellent data mining direction. Under the kind guidance of Mr. Wang, my ideas have been broadened, the bottlenecks in the research process have been solved, and my ideas have been implemented more steadily. At the same time, Mr. Wang is serious and responsible for his work and his mastery of professional knowledge. It is worth everyone to learn and respect.Secondly, I would like to thank Xi'an Technological University for guiding me to grow up. Finally, I am very grateful to the journal for including my thesis.