Paper
Paper
Paper
Andrej Trnka
Department of Applied Informatics
University of SS. Cyril and Methodius
Tmava, Slovak Republic
[email protected]
Abstract- This paper describes the way of Market Basket goals for each of them. Each of these measurements is an
Analysis implementation to Six Sigma methodology. Data important process output that matters to the customer. All six
Mining methods provide a lot of opportunities in the market similar projects must identify who the customer is for each
sector. Basket Market Analysis is one of them. Six Sigma process. The customer sets the acceptable parameters for
methodology uses several statistical methods. With each of the key process outputs. These are usually expressed
implementation of Market Basket Analysis (as a part of Data as specifications, but might take other forms. The key is that
Mining) to Six Sigma (to one of its phase), we can improve the
no project can proceed unless a quantifiable goal is stated. At
results and change the Sigma performance level of the process.
the end of the define stage you will have a well-defined
In our research we used GRI (General Rule Induction)
project with well-defined and quantifiable goals.
algorithm to produce association rules between products in the
The next stage is the Measure phase. At the output of
market basket. These associations show a variety between the
this phase we should have a thorough understanding of how
products. To show the dependence between the products we
used a Web plot. The last algorithm in analysis was CS.O. This
the process is behaving right now. By understanding we
algorithm was used to build rule-based profiles. mean a quantitative description. This description will include
current process variable averages, standard deviations,
Keywords-Data Mining; Six Sigma; Market Basket; CRISP behavior over time, and histograms. In addition, we will
DM know whether the process is stable or not. Another thing we
will know at the end of this stage is whether or not the
process has a capability and what the value is. The Sigma
I. INTRODUCTION
performance level, for which Six Sigma is named, is an
In our research we tried to implement a few Data Mining example of a capability measure. To do all of this we have to
methods to Six Sigma methodology and improve results with collect data, and to know what data to collect, we need to
them. Reference [7] describe possibility of utilization Data know which characteristics are critical. Sometimes this is
Mining methods in industry, generally. However, Six Sigma determined in the Define phase, but sometimes what we are
is specific with its approach to capability of the process. Six given in the Define phase is not sufficient for us to collect
Sigma methodology is used to improve the business data. The first thing we would then do is determine the
processes. precise measurements necessary to determine process
We can use DMAIC cycle for the existing process or capability.
DMADV cycle for a new process. The letters in the acronym The third phase in a DMAIC cycle is the Analyze phase.
mean activities in Six Sigma methodology. The overall approach in Six Sigma problem solving is to
DMAIC - Define, Measure, Analyze, Improve, Control carefully define a problem, find the root causes for the
DMADV - Define, Measure, Analyze, Design, Verify problem, and then attack the root causes. The purpose of the
In our research we improve the existing process, so we Analyze phase is to correctly identify what those root causes
focus to the DMAIC cycle of Six Sigma methodology. are and prove it with data. In this disciplined problem
The Define phase is where we start each project. It's solving method, opinions are not worth much. We might
helpful to think of the outputs of each stage to know what the have historical data that we can analyze using some basic
stage does. In this stage we establish the goal of the project. statistics or we might have to conduct some experiments and
We should know who our customer is, which process we will collect some new data.
be working on, which part of the process we will work on, In the Improve phase, we come up with solutions. This
which of the process variables are important, what the goal is phase also often starts with a brainstorming session. But now
for those process variables, how much money we expect to we know what the root causes are. A root cause is something
save or other benefits we will get, when the project will be that is controllable and has a direct effect on the
finished, who the project team is, and who the stakeholders characteristic we are trying to improve. This problem-solving
are. It is in the Define phase where we first see evidence of session again can be short and simple or long and complex,
Six Sigma thinking. The goal of a project always has to be depending on process. At the end of this phase, the process
stated as a measurement, the value of which will be improvement is installed and the process is now running per
improved. It might be stated as several measurements and goals. Some Six Sigma programs split this phase into two,
447
2010 International Conference on Networking and Information Technology
Data Preparation prepares the data. Once the data The use of Data Mining methods in Six Sigma
resources available are identified, they need to be selected, methodology requires joining of CRISP-DM process to
cleaned, built into the form desired, and formatted. Data DMAIC phases.
cleaning and data transformation in preparation of data
modeling needs to occur in this phase. Data exploration at a II. M ARKET BASKET ANALYSIS
greater depth can be applied during this phase, and additional Market basket analysis has the objective of indentifYing
models utilized, again providing the opportunity to see products, or groups of products, which tend to occur together
patterns based on business understanding. (are associated) in buying transactions (baskets). The
Modeling in Data Mining software tools such as knowledge obtained from a market basket analysis can be
visualization (plotting data and establishing relationships) very valuable; for instance, it can be employed by a
and cluster analysis (to identifY which variables go well supermarket to reorganize its layout, taking products
together) are useful for initial analysis. Tools such as frequently sold together and locating them in close proximity.
generalized rule induction can develop initial association But it can also be used to improve the efficiency of a
rules. Once greater data understanding is gained (often promotional campaign: products that are associated should
through pattern recognition triggered by viewing model not be put on promotion at the same time. By promoting just
output), more detailed models appropriate to the data type one of the associated products, it should be possible to
can be applied. The division of data into training and test sets increase the sales of that product and get accompanying sales
is also needed for modeling. increases for the associated products.
Evaluation of the results. Model results should be The databases usually considered in a market basket
evaluated in the context of the business objectives analysis consist of all the transactions made in a certain sale
established in the first phase (business understanding). This period (e.g. one year) and in certain sale locations (e.g. a
will lead to the identification of other needs (often through chain of supermarkets). Consumers can appear more than
pattern recognition), frequently reverting to prior phases of once in the database. In fact, consumers will appear in the
CRISP-DM. Gaining business understanding is an iterative database whenever they carry out a transaction at a sales
procedure in data mining, where the results of various location. The objective of the analysis is to find the most
visualization, statistical, and artificial intelligence tools show frequent combinations of products bought by the customers.
the user new relationships that provide a deeper The association rules in Section 4. 8 represent the most
understanding of organizational operations. natural methodology here; indeed they were actually
Deployment of the model. Data mining can be used to developed for this purpose. Analyzing the combinations of
both verifY previously held hypotheses, or for knowledge products bought by the customers, and the number of times
discovery (identification of unexpected and useful these combinations are repeated, leads to a rule of the type 'if
relationships). Through the knowledge discovered in the condition, then result' with a corresponding interestingness
earlier phases of the CRISP-DM process, sound models can measurement. Each rule of this type describes a particular
be obtained that may then be applied to business operations local pattern. The set of association rules can be easily
for many purposes, including prediction or identification of interpreted and communicated. Possible disadvantages are
key situations. These models need to be monitored for locality and lack of probability modeling.
changes in operating conditions, because what might be true In shop the recorded transactions are all the transactions
today may not be true a year from now. If significant made by someone holding one of the chain's loyalty cards.
changes do occur, the model should be redone. It's also wise Each card carries a code that identifies features about the
to record the results of data mining projects so documented owner, including important personal characteristics such as
evidence is available for future studies. [5] sex, date of birth, partner's date of birth, number of children,
Fig. 2 shows the CRISP-DM process. profession and education. The card allows the analyst to
follow the buying behavior of its owner: how many times
they go to the supermarket in a given period, what they buy,
whether they follow the promotions, etc. [ 1]
Similar analyses can be found in [ 1].
448
2010 International Conference on Networking and Information Technology
product F ; ; product A
J
pictures of association between products in the basket. We -. -- - - - - - - - - - - - - -- ,- - - - '
..
-
. --�. -
:
.. - . -
- .. .. .<
:
" . � .. -
, .... .. .. -
shows generated association rules between products. Data in
the table are sorted by confidence. product G O':- -' product J
product H
TABLE II. ASSOCIATION RULES (A PART)
o product Ae product Be product C
Consequent Antecedent Support % Confidence % 'product DO product EO product F
product G product B 3,47 95,45 '. product c:IJ product He product I
product D
Oproduct J
product F
product D product B 3,47 95,45 Figure 3. Web plot of associations
product F
product G In the third step we profiled the customer groups. We
product D product A 4,42 92,86 needed to know who these customers are. This could be
product F
achieved by tagging each customer with a flag for each of
product G
product F product C 2,21 92,86 these groups. To build rule-based profiles of these flags we
product D used rule induction (C5.0). Consequently we received the
product G following rules:
product D product E 3,79 91,67
product F Generated rule for group A:
product G
Rule 1 for T (true)
product F product B 3,63 91,3
product D if income <= 16 900
product G and sex=M
product F product D 16,56 88,57 then T
product G
product G product A 4,73 86,67 Generated rules for group B:
product D
Rule 1 for T
product F
product F product A 4,73 86,67
if age <= 19
product D and income> 26 800
product G then T
product G product C 2,37 86,67 Rule 2 for T
product D if age> 22
product F
and age <=24
product G product D 17,03 86,11
product F and income <=20 900
product D product F 17,03 86,11 and payment_method=CASH
product G then T
product D product E 5,68 72,22 Rule 3 for T
IJroduct G if age> 16
and age <=22
These association rules (two-way) show a variety and income> 12 200
between products G, products D and products F. and income <= 15 500
Next step was a graphical view to associated products. and payment_method=CARD
We used a Web plot to show the dependence between the and sex=F
products. The results are showed in Fig. 3. then T
Bold lines in this plot show the groups of customers Rule 4 for T
suggested by the GRI model. In the resulting display, two if age <=24
groups of customers stand out: and income> 26 800
• those who buy product D, product F and product G and payment_method=CHEQUE
(group A), then T
• those who buy product A and product J (group B).
449
2010 International Conference on Networking and Information Technology
450