Association Rule-A Tool For Data Mining: Praveen Ranjan Srivastava

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

ASSOCIATION RULE- A TOOL FOR DATA MINING

By
Praveen Ranjan Srivastava
Lecturer (Computer Science)
Banasthali Vidyapith (Deemed University)
Banasthali-304002 (Rajasthan)
Phone : 01438-228787
E-mail : [email protected] / [email protected]
 

ABSTRACT:

Data mining is the process of extracting knowledge from large amount of databases
whether database is relational, temporal, spatial, multimedia etc. Data mining is very
popular technology some companies like IBM, ORACLE, and TCS are working on data
mining. Data mining engine having different kinds of approach like classification, clustering,
association, outliers analysis. Each approach has a different technology like classification
based on decision tree induction whether association based on interesting pattern
generation. Some people are saying data mining is the part of knowledge discovery in
databases (KDD). We can say Data mining tools perform data analysis and may uncover
important data patterns. This paper has a full attention on association rule. Association rule
is very important tool for mining process it has two special chartersticis first one is support
and another is confidence. Support gives total number of transaction of any particular item
are occurring in datasets while confidence gives strength of a data in a dataset, we can say
support is probability of A and B while confidence is conditional probability. Association rule
based on these two characteristics. Different algorithms support association rule but this
paper show only two very popular approach Apriori and FP-tree method. Both approach
gives frequent patterns, and candidate generation; frequent pattern means those item sets
that satisfy minimum support value. We can say in one-word association rule gives an
"interesting pattern"; this paper give and tries how association rules generate interesting
patterns form a huge amount of databases.

INTRODUCTION:

The major reason that data mining has attracted a great deal of attention in the information
industry in recent years is due to the wide availability of huge amount of data and the
imminent need for turning such data into useful information and knowledge. The
information and knowledge gained can be used for application ranging from business
management, production control, and market analysis, to engineering design and science
exploration. Data mining can be viewed as a result of the natural evolution of information
technology. Data mining tools can answer business questions that traditionally were too
time consuming to resolve. They scour databases for hidden patterns, finding predictive
information that experts may miss because it lies outside their expectations. Data mining
used different types of engine such as classification, clustering, association, and
outliers .In classification class label is known whether in clustering class label is unknown,
association gives to interesting pattern of the data while outliers give fraud detections.

What is Data Mining?


Data mining tools perform data analysis and may uncover important data patterns,
contributing greatly to business strategies, knowledgebase's, and scientific and medical
research. The widening gap between data and information calls for a systematic
development of data mining tools. So simply you can say data mining refers to extracting
or "mining" knowledge from large amount of data. 

KDD (Knowledge Discovery in Databases): Many people treat data mining as a


synonym for another popularly used term, i.e. KDD. Alternatively, others view data mining
as simply an essential step in the process of knowledge discovery in databases. KDD
process is depicted in figure. 
 

1. Data Cleaning: To remove noise and inconsistent data.

2. Data integration: Where multiple data source may be combined.

3. Data Selection: Where data relevant to the analysis task are retrieved from the
database.

4. Data Transformation: Where data are transformed

5. Data mining: An essential process where intelligent methods are applied in order to
extract data pattern.

6. Pattern Evaluation: To identify the interesting pattern.

7. Knowledge presentation: Where visualization and knowledge representation


techniques are used.

Architecture of a data mining system: The Typical data mining architecture may have
following major component.
 
The data mining components are

 Database, data warehouse, or other information repository


 Database, data warehouse server
 Data mining engine
 Patterns evaluation model
 Graphical user interface

What is association analysis? Association analysis is the discovery of association rules


showing attribute-value conditions that occur frequently together in a given set of data.
Association analysis is widely used for market basket or transition data analysis.

Association rule having two main important properties.

 Support
 Confidence

The definition of the support and confidence is

                                        Support (AB) = P (AUB)

                                      Confidence (AB) = P (B|A).

If we correlate support and confidence then

                Confidence ( AB) = P (B|A) = Support_ count (AUB)/Support_ count (A)

Where Support_ count (AUB) is the number of transaction containing the item sets AU B,
and Support_ count (A) is the number of transactions containing the item set A.

"How are association rules mined from large databases?" Association rule mining is a two
–step process:

1. Find all frequent item sets: By definition, each of these item sets will occur at
least as frequently as a predetermined minimum support count.
2. Generate strong association rules from the frequent item sets: By
definition, these rules must satisfy minimum support and minimum support and
minimum confidence.

APRIORI ALGORITHM: (Finding Frequent Item sets Using Candidate Generation)

Apriori is an influential algorithm for mining frequent item sets. The name of the algorithms
is based on the fact that the algorithm uses prior knowledge of frequent item sets
properties. Apriori employs an iterative approach known as a level-wise search.

To improve the efficiency of the level-wise generation of frequent item sets, an important
property called the apriori property, i.e." all nonempty subsets of a frequent item sets
must also be frequent."

Apriori algorithms having a two-step process.

o The join step: To find  Lk , a set of candidate k item sets is generated by


joining Lk-1  with itself. This set of candidate is denoted Ck.
o The prune step:  Ck is the superset of Lk, that is, its members may or may
not be frequent, but all of the frequent k-itemsets are included in Ck. A  scan
of the databases to determine the count of each candidate in Ck would result
in the determination of Lk. (i.e. all candidates having a count no less than the
minimum support count are frequent by definition, and therefore belongs to
Lk) 

procedure AprioriAlg()
begin

L1 := {frequent 1-itemsets};
for ( k := 2; Lk-1 0; k++ ) do {
    Ck= apriori-gen(Lk-1) ; // new candidates
for all transactions t in the dataset do {
   for all candidates c Ck contained in t do
             c:count++
   }
   Lk = { c Ck | c:count >= min-support}
}
Answer := k Lk

end

It makes multiple passes over the database. In the first pass, the algorithm simply counts
item occurrences to determine the frequent 1-itemsets (itemsets with 1 item). A subsequent
pass, say pass k, consists of two phases. First, the frequent itemsets Lk-1 (the set of all
frequent (k-1)-itemsets) found in the (k-1)th pass are used to generate the candidate
itemsets Ck, using the apriori-gen() function. This function first joins Lk-1 with Lk-1, the joining
condition being that the lexicographically ordered first k-2 items are the same. Next, it
deletes all those itemsets from the join result that have some (k-1)-subset that is not in Lk-1
yielding Ck.
The algorithm now scans the database. For each transaction, it determines which of the
candidates in Ck are contained in the transaction using a hash-tree data structure and
increments the count of those candidates. At the end of the pass, Ck is examined to
determine which of the candidates are frequent, yielding Lk . The algorithm terminates when
Lk becomes empty.

FP-TREE GROWTH ALGORITHM:

Apriori algorithms suffer from the following two shortcomings:

1. It is costly to handle large numbers of candidate sets. For instance, 104 frequent
1-itemsets, then approximately, 107   candidate 2-itemsets are generated.

2. It is tedious to repeatedly scan the database and check a large set of candidates
by pattern matching.

Keeping this in mind, a new class of algorithms has recently been proposed which avoids
the generation of large numbers of candidate sets. We describe one such method, called the
FP-tree growth algorithm. It is proposed by Han et al. The main idea of the algorithm is to
maintain a frequent pattern tree of the databases.

A frequent pattern tree (or FP-tree) is a tree structure consisting of an item-prefix-tree and
a frequent item-header table.

 Item- prefix-tree:

o It consists of a root node labeled null

o Each non-root node consists of three fields:

* Item name

* Support count

* Node link.

 Frequent-item –header-table: It consists of two fields;

* Item name

* Head of node link which points to the first node in the FP-tree

* Carrying the item name.

Association rules should not be used directly for prediction without further analysis or
domain knowledge. They do not   necessarily indicate causation. They are however a helpful
starting point for further exploration, making them a popular tool for understanding data.
CONCLUSION:

The discovery of association relationship among huge amount of data is useful in selective
marketing, decision analysis, and business management.  A popular area of application is
market basket analysis, which studies the buying habits of customers by searching for
sets of item that are frequently purchased together. Association rule mining consists of
first finding frequent item sets, from which strong association rules in the form of A B are
generated. Association rule is the important tool for data mining engine. It is very popular
technology now days. This paper have used only two approaches for association rule for
mining the process named apriori and fp-tree. These two algorithms are very popular for
mining rule. Last but not least we can say association rule gives interesting pattern to our
customers. 

References:

* http://www.kdnuggets.com/

* http://www.dmg.org/

* http://www.almaden.ibm.com/software/quest/

* http://www.data-mine.com/bin/site/templates/splash.asp

* http://www.thearling.com/

* www.indianmba.com

* Data mining a book by A.K.PUJARI (University of Hyderabad)

* Data mining book by Dunham

* Research papers by Rakesh Aggrawal and S Srikanth

You might also like