Predicting Missing Items in Shopping Carts Using Fast Algorithm
Predicting Missing Items in Shopping Carts Using Fast Algorithm
Predicting Missing Items in Shopping Carts Using Fast Algorithm
5, May 2011
Sunil Kumar. M
Sri Venkateswara College of Engineering, Chennai- 602105
Vijayshankar. V
Sri Venkateswara College of Engineering, Chennai- 602105
Leela Rani P
Assistant Professor Sri Venkateswara College of Engineering, Chennai- 602105
ABSTRACT
Prediction in shopping cart uses partial information about the contents of a shopping cart for the prediction of what else the customer is likely to buy. In order to reduce the rule mining cost, a fast algorithm generating frequent itemsets without generating candidate itemsets is proposed. The algorithm uses Boolean vector with relational AND operation to discover frequent itemsets and generate the association rule. Association rules are used to identify relationships among a set of items in database. Initially Boolean Matrix is generated by transforming the database into Boolean values. The frequent itemsets are generated from the Boolean matrix. Then association rules are to generated from the already generated frequent itemsets. The association rules generated form the basis for prediction. The incoming itemset i.e the content of incoming shopping cart will also be represented by a Boolean vector and AND operation is performed with each transaction vector to generate the association rules. Finally the rules are combined to get the predictions. Dempsters rule of combination (DRC) is used to combine the evidences. Finally the predicted items are suggested to the user.
hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Most companies collect and refine massive quantities of data. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client/server or parallel processing computers, data mining tools can analyze massive databases to deliver answers to many questions. The information and knowledge gained can be used for application ranging from market analysis, fraud detection, and customer retention, to production control and science exploration. Data Mining plays an important role in online shopping for analyzing the subscribers data and understanding their behaviors and making good decisions such that customer acquisition and customer retention are increased which gives high revenue.
Keywords
Association Rule Mining. Boolean Vector, Prediction, Basic Belief Assignment, Demster Shafer Theory of Rule Combination.
1.3 Prediction
Data mining automates the process of finding predictive information in large databases. Questions that traditionally
35
International Journal of Computer Applications (0975 8887) Volume 21 No.5, May 2011 required extensive hands-on analysis can now be answered directly from the data quickly. The primary task of association mining is to detect frequently co-occurring groups of items in transactional databases. The intention is to use this knowledge for prediction purposes. Early attempts for prediction used classification [8] and performance was favourable. In this project, any item is allowed to be treated as a class label its value is to be predicted based on the presence of other items. Put another way, knowing a subset of the shopping carts contents, we want to guess (predict) [1] the rest. Suppose the shopping cart of a customer at the checkout counter contains bread, butter, milk, cheese, and pudding. Could someone who met the same customer when the cart contained only bread, butter, and milk, have predicted that the person would add cheese and pudding? It is important to understand that allowing any item to be treated as a class label presents serious challenges as compared with the case of just a single class label. The number of different items can be very high, perhaps hundreds, or thousand, or even more. To generate association rules for each of them separately would give rise to great many rules with two obvious consequences: first, the memory space occupied by these rules can be many times larger than the original database (because of the tasks combinatorial nature); second, identifying the most relevant rules and combining their sometimes conflicting predictions may easily incur prohibitive computational costs. In this work, both of these problems are solved by developing a technique that answers users queries (for shopping cart completion) in a way that is acceptable not only in terms of accuracy, but also in terms of time and space complexity. This paradigm can be exploited in diverse applications. For example, in the each shopping cart contained a set of hyperlinks pointing to a Web page; in medical applications, the shopping cart may contain a patients symptoms, results of lab tests, and diagnoses; in a financial domain, the cart may contain companies held in the same portfolio. In all these databases, prediction of unknown items can play a very important role. For instance, a patients symptoms are rarely due to a single cause; two or more diseases usually conspire to make the person sick. Having identified one, the physician tends to focus on how to treat this single disorder, ignoring others that can meanwhile deteriorate the patients condition. Such unintentional neglect can be prevented by subjecting the patient to all possible lab tests. However, the number of tests one can undergo is limited by such practical factors as time, costs, and the patients discomfort. A decisionsupport system advising a medical doctor about which other diseases may accompany the ones already diagnosed can help in the selection of the most relevant additional tests. upper-bounded by twice the number of transactions in the original database. Note that some of the itemsets in IT-tree [4] are identical to at least one of the transactions contained in the original database, whereas others were created during the process of tree building where they came into being as common ancestors of transactions from lower levels. They modified the original tree building algorithm by flagging each node that is identical to at least one transaction. These are indicated by black dots. This is called flagged IT-tree [4].
Here is an example for an IT-tree [4]. The flagged IT-tree of the database D = { [1, 4] , [2, 5] , [1, 2, 3, 4, 5] , [1, 2, 4] , [2, 5] , [2, 4] is
f=3 1 f=2 1, 4 1, 2 2
f=3
f=1
f=2
f=1
1,2,3,4,5
1,2,4
f=1
Fig.1.1. Example IT-Tree Here some items in this tree are flagged to represent them as weak entity so that they are not carried for the next stage of processing. The disadvantages of existing approach are Time taken to construct IT-Tree [4] is more when compared to Boolean matrix method. This method requires more memory for processing.
2. SYSTEM DESIGN
Any project developed today is said to be good only if it has some basic characteristics such as modularity, loose coupling and high cohesion. A component is classified as good quality only if it is modular, loosely coupled and has high cohesion i.e., each component should be independent of the other and each
36
International Journal of Computer Applications (0975 8887) Volume 21 No.5, May 2011 component must be focused only on its particular purpose. Finally the component should be modular so that the development of the components is understandable, can easily be enhanced in the future and also easy to locate and correct errors without affecting the other components involved in the project. The following sections deals with how this system is designed, the modules involved and overall architecture diagram of the system which shows the modules present in the system. However, many rules with equal antecedents differ in their consequentssome of these consequents contain the desired item to be predicted, others do not. The question is how to combine (and how to quantify) the potentially conflicting evidences. DRC [6] is used for this purpose. Finally the predicted items are suggested to the user.
3. IMPLEMENTATION
This topic consists of detailed description of each and every module with its advantages and data and execution flow of each module with algorithm. It helps to understand each and every module of the project more deeply and clearly. Each description consists of the basic concept of the module, input and also the excepted output.
Frequent Itemset Generation Incoming Instance Association Rule Generation BBA and Rule Selection
3.1 Modules
The project has been divided into various modules and each module has been completed within a scheduled time line. The following are the modules of the project are boolean matrix generation,Frequent itemset generation, Association rule generation, BBA and decision making.
Prediction
Fig.2.1. Shopping cart prediction architecture Fig.4.1. shows the shopping cart prediction architecture in which the Boolean Matrix is generated by transforming the database into Boolean values. The frequent itemsets are generated from the Boolean matrix. At this stage we need the Support value. Then association rules are to generated from the already generated frequent itemsets. It takes minimum confidence from the user and discovers all rules with a fixed antecedent and with different consequent. The association rules generated form the basis for prediction. We assign BBA [2] value to each association rule generated. This gives more weight to rules with higher support masses are assigned based on both their confidence and support values. The incoming itemset i.e the content of incoming shopping cart will also be represented by a Boolean vector and AND operation is performed with each transaction vector to generate the association rules. Finally the rules are combined to get the predictions. Dempsters rule of combination (DRC) [6] is used to combine the evidences. When searching for a way to predict the presence or absence of an item in a partially observed shopping cart s, we wanted to use association rules.
37
International Journal of Computer Applications (0975 8887) Volume 21 No.5, May 2011
Table ID Field1 10 N 11 N 12 Y 13 N 14 N 15 N Field2 N N Y N N N Field3 Y N N Y Y Y Field4 Y Y N Y Y Y Field5 Y Y N Y Y Y Field6 y y y y y y Field7 N N Y Y Y N
fk={ ii1, ii2,,iik }; } for each item ii in fk if | fk(ii)| < k delete the column ci according to item ii from bdb; for each row rj from bdb if sum(rj) < k+1 delete rj from bdb; k=k+1 } return f= f1 u f2 u fk The input given to this stage is the Boolean matrix generated in previous module. This is the sample input that is given for illustration. The support value is also given as input.
1 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0
Table.3.1 shows that If an item is contained in a transaction, the corresponding attribute value will be Y, otherwise the value will be N.The output of this module will be the Boolean matrix, which will look as this.
1 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0
Support Value: 50
The output of this stage is the frequent itemsets generated from Boolean matrix. Each frequent itemset generated is also converted into a Boolean vector as shown in the table below. Input configuration: 7 items, 8 transactions, minsup = 50.0% Frequent 1-itemsets [3, 4, 5, 6, 7] Frequent 2-itemsets [3 4, 3 5, 3 6, 4 5, 4 6, 5 6, 5 7, 6 7] Frequent 3-itemsets [3 4 5, 3 4 6, 3 5 6, 4 5 6, 5 6 7] Frequent 4-itemsets [3 4 5 6] Table.3.2. Frequent itemsets generation output
--------------------------------------------------------------------------------------Itemset Boolean Vector Support --------------------------------------------------------------------------------------3 [null, false, false, true, false, false, false, false] 0.5 4 [null, false, false, false, true, false, false, false] 0.625 5 [null, false, false, false, false, true, false, false] 0.875 6 [null, false, false, false, false, false, true, false] 1.0 7 [null, false, false, false, false, false, false, true] 0.625 34 [null, false, false, true, true, false, false, false] 0.5 35 [null, false, false, true, false, true, false, false] 0.5 36 [null, false, false, true, false, false, true, false] 0.5 45 [null, false, false, false, true, true, false, false] 0.625 46 [null, false, false, false, true, false, true, false] 0.625 56 [null, false, false, false, false, true, true, false] 0.875 57 [null, false, false, false, false, true, false, true] 0.5 67 [null, false, false, false, false, false, true, true] 0.625 345 [null, false, false, true, true, true, false, false] 0.5 346 [null, false, false, true, true, false, true, false] 0.5 356 [null, false, false, true, false, true, true, false] 0.5 456 [null, false, false, false, true, true, true, false] 0.625 567 [null, false, false, false, false, true, true, true] 0.5 -------------------------------------------------------------------------------------
If an item is contained in a transaction, the corresponding attribute value will be 1; otherwise the value will be 0.
38
International Journal of Computer Applications (0975 8887) Volume 21 No.5, May 2011 Table.3.2. shows the frequent itemsets along with the Boolean Vector generated from Boolean Matrix. User Support value is needed for this. If an item is present in the frequent itemset, it takes the value True in the Boolean Vector, otherwise the value will be False. It involves Join step and Prune step. The frequent itemset generated should have desired support value. else found=0 endif endif end do end do The input given to the rule generation process is the Boolean vectors representing each transaction and also the contents of incomplete shopping cart. This is shown in Table 5.3. Table.3.3. Rule Generation Input
--------------------------------------------------------------------------------------Itemset Boolean Vector Support --------------------------------------------------------------------------------------3 [null, false, false, true, false, false, false, false] 0.5 4 [null, false, false, false, true, false, false, false] 0.625 5 [null, false, false, false, false, true, false, false] 0.875 6 [null, false, false, false, false, false, true, false] 1.0 7 [null, false, false, false, false, false, false, true] 0.625 34 [null, false, false, true, true, false, false, false] 0.5 35 [null, false, false, true, false, true, false, false] 0.5 36 [null, false, false, true, false, false, true, false] 0.5 45 [null, false, false, false, true, true, false, false] 0.625 46 [null, false, false, false, true, false, true, false] 0.625 56 [null, false, false, false, false, true, true, false] 0.875 57 [null, false, false, false, false, true, false, true] 0.5 67 [null, false, false, false, false, false, true, true] 0.625 345 [null, false, false, true, true, true, false, false] 0.5 346 [null, false, false, true, true, false, true, false] 0.5 356 [null, false, false, true, false, true, true, false] 0.5 456 [null, false, false, false, true, true, true, false] 0.625 567 [null, false, false, false, false, true, true, true] 0.5 -------------------------------------------------------------------------------------
Enter confidence: 80 Enter incoming shopping cart contents: 3 4 Table.3.4. Rule Generation Output
---------------------------------Rule Supp conf ---------------------------------3 4 ->5 0.5 1.0 3 4 ->6 0.5 1.0 ---------------------------------
Table.3.4. shows the output rules generated along with the support and confidence values.
39
International Journal of Computer Applications (0975 8887) Volume 21 No.5, May 2011 The partitioned-support p_supp of the rule, r (a) -> r(c), is the percentage of transactions that contain r (a) among those transactions that contain r(c), i.e., p_supp = support(r (a) U r(c)) / support(r(c))
3.6 Comparison
The performance of both the existing tree approach and the proposed approach is analyzed with databases of different sizes. The results found are very much surprising in the proposed approach compared to the tree approach. The time of prediction has decreased to a great extent compared to existing tree approach. Table.3.7. Execution Time Comparison Number of Transactions 100 80 60 Tree Approach Execution Time 48.7 44.172 42.031 40.015 38.721 Proposed Approach Execution Time 0.797 0.625 0.578 0.593 0.547
3.5.2 BBA:
In association mining techniques, a user-set minimum support decides about which rules have high support. Once the rules are selected, they are all treated the same, irrespective of how high or how low their support. Decisions are then made solely based on the confidence value of the rule. However, a more intuitive approach would give more weight to rules with higher support. Therefore, we use a novel method to assign to the rules masses based on both their confidence and support values. This weight value is called Basic Belief Assignment (BBA) [2]. We assign BBA value to each association rule generated. = ((1+ ) x conf x p_supp ) / ( x conf + p_supp); [0,1]; Dempsters rule of combination (DRC) [6] is used to combine the evidences. When searching for a way to predict the presence or absence of an item in a partially observed shopping carts, we wanted to use association rules. However, many rules with equal antecedents differ in their consequentssome of these consequents contain the desired item to be predicted, others do not. The question is how to combine (and how to quantify) the potentially conflicting evidences. DRC [6] is used for this purpose. Some illustrations used from DRC [6] are explained in following paragraph. We remove the overlapping rules while keeping the highest confidence rule. If two overlapping rules have the same confidence, the rule with the lower support is dropped. Finally the best rule is selected by comparing the mass values. The predicted item is then suggested to the user. The input given to this stage is the set of rules generated along with their support and confidence values as shown in Table 5.5. Table.3.5. BBA and Decision Making Input
---------------------------------Rule Supp conf ---------------------------------3 4 ->5 0.5 1.0 3 4 ->6 0.5 1.0 --------------------------------2 2
40 20
Table.3.7. shows the comparison of execution time between the existing tree approach and proposed approach for different transactions.
4. CONCLUSION
The fast algorithm generating frequent itemsets without generating candidate itemsets proposed performs well compared to existing approaches. The execution time is much improved as shown in performance testing. The algorithm used Boolean vector with relational AND operation to discover frequent itemsets and generate the association rule. Association rules formed the basis of prediction. The algorithm is applied in a demo shopping cart application. When user adds each item to the cart the algorithm is executed and the prediction is displayed.
Predicted Result:
Table 3.6 shows the output of this stage i.e, the calculated BBA [6] values. The prediction made is also shown below the table.
40
International Journal of Computer Applications (0975 8887) Volume 21 No.5, May 2011 The Advantages of proposed work are The proposed algorithm doesnt generate the candidate itemsets. It uses only a single pass over the database. The memory consumed is also very less. Processing speed is more when compared to rules generated using item set tree and DS theory. Symp. Computers and Comm. (ISCC 01), pp. 107-113, 2001. [4]. R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules, Proc. Intl Conf.Very Large Databases(VLDB 94), pp.487-499, 1994. [5]. K.K.R.G.K. Hewawasam, K. Premaratne, and M.-L. Shyu, Rule Mining and Classification in a Situation Assessment Application: A Belief Theoretic Approach for Handling Data Imperfections, IEEE Trans. Systems, Man, Cybernetics, B, vol. 37, no. 6 pp. 1446-1459, Dec. 2007. [6] Apriori Algorithm Reference URL: http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/items et_prog1.html
5. FUTURE ENHANCEMENT
The method proposed in this project was tested with a demo shopping cart and the performance is found acceptable. But further improvements can be made to reduce the cost of rule generation process. Also new data structures can be proposed that will be more suitable to handle large number of itemsets as in shopping cart.
6. REFERENCES
[1]. Kasun Wickramaratna, Miroslav Kubat and Kamal Premaratne, Predicting Missing Items in Shopping Carts, IEEE Trans. Knowledge and Data Eng., vol. 21, no. 7, july 2009. [2]. M.Anandhavalli, Sandip Jain, Abhirup Chakraborti, Nayanjyoti Roy and M.K.Ghose Mining Association Rules Using Fast Algorithm, Advance Computing Conference (IACC), 2010 IEEE 2nd International. [3]. H.H. Aly, A.A. Amr, and Y. Taha, Fast Mining of Association Rules in Large-Scale Problems, Proc. IEEE
[7] P. Bollmann-Sdorra, A. Hafez, and V.V. Raghavan, A Theoretical Framework for Association Mining Based on the Boolean Retrieval Model, Data Warehousing and Knowledge Discovery: Proc. Third Intl Conf. (DaWaK 01), pp. 21-30, Sept. 2001. [8] W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules, Proc. IEEE Intl Conf. Data Mining (ICDM 01), pp. 369376, Nov./Dec. 2001. [9] M. Kubat, A. Hafez, V.V. Raghavan, J.R. Lekkala, and W.K. Chen, Itemset Trees for Targeted Association Querying, IEEE Trans. Knowledge and Data Eng., vol. 15, no. 6, pp. 1522-1534, Nov./Dec.2003.
41