Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

201O International Conference on Networking and Information Technology

Market Basket Analysis with Data Mining Methods


Six Sigma methodology improvement

Andrej Trnka
Department of Applied Informatics
University of SS. Cyril and Methodius
Tmava, Slovak Republic
[email protected]

Abstract- This paper describes the way of Market Basket goals for each of them. Each of these measurements is an
Analysis implementation to Six Sigma methodology. Data important process output that matters to the customer. All six
Mining methods provide a lot of opportunities in the market similar projects must identify who the customer is for each
sector. Basket Market Analysis is one of them. Six Sigma process. The customer sets the acceptable parameters for
methodology uses several statistical methods. With each of the key process outputs. These are usually expressed
implementation of Market Basket Analysis (as a part of Data as specifications, but might take other forms. The key is that
Mining) to Six Sigma (to one of its phase), we can improve the
no project can proceed unless a quantifiable goal is stated. At
results and change the Sigma performance level of the process.
the end of the define stage you will have a well-defined
In our research we used GRI (General Rule Induction)
project with well-defined and quantifiable goals.
algorithm to produce association rules between products in the
The next stage is the Measure phase. At the output of
market basket. These associations show a variety between the
this phase we should have a thorough understanding of how
products. To show the dependence between the products we
used a Web plot. The last algorithm in analysis was CS.O. This
the process is behaving right now. By understanding we
algorithm was used to build rule-based profiles. mean a quantitative description. This description will include
current process variable averages, standard deviations,
Keywords-Data Mining; Six Sigma; Market Basket; CRISP­ behavior over time, and histograms. In addition, we will
DM know whether the process is stable or not. Another thing we
will know at the end of this stage is whether or not the
process has a capability and what the value is. The Sigma
I. INTRODUCTION
performance level, for which Six Sigma is named, is an
In our research we tried to implement a few Data Mining example of a capability measure. To do all of this we have to
methods to Six Sigma methodology and improve results with collect data, and to know what data to collect, we need to
them. Reference [7] describe possibility of utilization Data know which characteristics are critical. Sometimes this is
Mining methods in industry, generally. However, Six Sigma determined in the Define phase, but sometimes what we are
is specific with its approach to capability of the process. Six given in the Define phase is not sufficient for us to collect
Sigma methodology is used to improve the business data. The first thing we would then do is determine the
processes. precise measurements necessary to determine process
We can use DMAIC cycle for the existing process or capability.
DMADV cycle for a new process. The letters in the acronym The third phase in a DMAIC cycle is the Analyze phase.
mean activities in Six Sigma methodology. The overall approach in Six Sigma problem solving is to
DMAIC - Define, Measure, Analyze, Improve, Control carefully define a problem, find the root causes for the
DMADV - Define, Measure, Analyze, Design, Verify problem, and then attack the root causes. The purpose of the
In our research we improve the existing process, so we Analyze phase is to correctly identify what those root causes
focus to the DMAIC cycle of Six Sigma methodology. are and prove it with data. In this disciplined problem­
The Define phase is where we start each project. It's solving method, opinions are not worth much. We might
helpful to think of the outputs of each stage to know what the have historical data that we can analyze using some basic
stage does. In this stage we establish the goal of the project. statistics or we might have to conduct some experiments and
We should know who our customer is, which process we will collect some new data.
be working on, which part of the process we will work on, In the Improve phase, we come up with solutions. This
which of the process variables are important, what the goal is phase also often starts with a brainstorming session. But now
for those process variables, how much money we expect to we know what the root causes are. A root cause is something
save or other benefits we will get, when the project will be that is controllable and has a direct effect on the
finished, who the project team is, and who the stakeholders characteristic we are trying to improve. This problem-solving
are. It is in the Define phase where we first see evidence of session again can be short and simple or long and complex,
Six Sigma thinking. The goal of a project always has to be depending on process. At the end of this phase, the process
stated as a measurement, the value of which will be improvement is installed and the process is now running per
improved. It might be stated as several measurements and goals. Some Six Sigma programs split this phase into two,

978-1-4244-7578-0/$26.00 © 2010 IEEE 446


2010 International Conference on Networking and Information Technology

where the improve stage involves figuring out what the


solution is and the implement stage actually implements the
solution. This is because there are different skills needed to
problem solve (improve) and to build (implement).
The final DMAIC phase is Control. This is unique to the
Six Sigma approach. Most managers have heard of the
Hawthorne effect. The Hawthorne effect is a temporary
change of behavior or performance in response to a change
in the environmental conditions, with the response being
typically an improvement. The term was coined in 1955 by
Henry A. Landsberger. Landsberger defined the Hawthorne
effect as a short-term improvement caused by observing
worker performance. In other words, it's not what we did
that improved performance, it's the fact that we paid
attention to it that caused everyone around you to behave
better than normal. Even if the Hawthorne effect isn't
operating on project, because problems are more technical
and equipment oriented and less reliant on people's behavior,
there is still a tendency for systems that have undergone an
improvement to degrade back to where they were. In this
phase the team makes sure that appropriate things are done
so that the process will continue to perform at its new level.
This can include administrative things like making sure that
training materials or written procedures are modified,
making sure that new specifications are transmitted and
understood by everyone who needs them, including suppliers,
and making sure that critical control points are monitored
and that procedures exist to react quickly when something
goes out of balance. End

Fig. 1 shows the DMAIC phases in Six Sigma


methodology. [2] Figure 1. DMAIC cycle (in BPMN notification)
Defects per Million Opportunities (DPMO) is a major
metric in Six Sigma methodology. DPMO is a measure of We can use a lot of Data Mining methods in Six Sigma
the process performance. DPMO is the average number of methodology. This implementation can improve results of
defects per unit observed during an average production the process by increasing the yield and decreasing the
process divided by the number of opportunities to make a DPMO value. This has a direct impact to the Sigma
defect on the product under study during that process performance level. The aim is to achieve the highest possible
normalized to one million. Tab. I shows Six Sigma value of the Sigma performance level.
performance level with DPMO and corresponding yield and Data mining is the process of discovering meaningful
defects of the process. [3] new correlations, patterns and trends by sifting through large
amounts of data stored in repositories, using pattern
TABLE!. SIX SIGMA PERFORMANCE LEVEL recognition technologies as well as statistical and
Sigma DPMO Percent Percentage mathematical techniques. [4]
performance defective yield There exist a few Data Mining models. Nowadays Data
level Mining processes are based on "CRoss-Industry Standard
I 691462 69% 31% Process for Data Mining" (CRISP-DM). This model consists
2 308 538 31% 69% of six phases intended as a cyclical process.
3 66 807 6,7% 93,3%
Business Understanding includes determining business
4 6 21O 0,62% 99,38%
objectives, assessing the current situation, establishing data
5 233 0,023% 99,977%
6 3,4 0,00034% 99,99966% mining goals, and developing a project plan.
Data Understanding considers data requirements. This
step can include initial data collection, data description, data
exploration, and the verification of data quality. Data
exploration such as viewing summary statistics (which
includes the visual display of categorical variables) can occur
at the end of this phase. Models such as cluster analysis can
also be applied during this phase, with the intent of
identifying patterns in the data.

447
2010 International Conference on Networking and Information Technology

Data Preparation prepares the data. Once the data The use of Data Mining methods in Six Sigma
resources available are identified, they need to be selected, methodology requires joining of CRISP-DM process to
cleaned, built into the form desired, and formatted. Data DMAIC phases.
cleaning and data transformation in preparation of data
modeling needs to occur in this phase. Data exploration at a II. M ARKET BASKET ANALYSIS
greater depth can be applied during this phase, and additional Market basket analysis has the objective of indentifYing
models utilized, again providing the opportunity to see products, or groups of products, which tend to occur together
patterns based on business understanding. (are associated) in buying transactions (baskets). The
Modeling in Data Mining software tools such as knowledge obtained from a market basket analysis can be
visualization (plotting data and establishing relationships) very valuable; for instance, it can be employed by a
and cluster analysis (to identifY which variables go well supermarket to reorganize its layout, taking products
together) are useful for initial analysis. Tools such as frequently sold together and locating them in close proximity.
generalized rule induction can develop initial association But it can also be used to improve the efficiency of a
rules. Once greater data understanding is gained (often promotional campaign: products that are associated should
through pattern recognition triggered by viewing model not be put on promotion at the same time. By promoting just
output), more detailed models appropriate to the data type one of the associated products, it should be possible to
can be applied. The division of data into training and test sets increase the sales of that product and get accompanying sales
is also needed for modeling. increases for the associated products.
Evaluation of the results. Model results should be The databases usually considered in a market basket
evaluated in the context of the business objectives analysis consist of all the transactions made in a certain sale
established in the first phase (business understanding). This period (e.g. one year) and in certain sale locations (e.g. a
will lead to the identification of other needs (often through chain of supermarkets). Consumers can appear more than
pattern recognition), frequently reverting to prior phases of once in the database. In fact, consumers will appear in the
CRISP-DM. Gaining business understanding is an iterative database whenever they carry out a transaction at a sales
procedure in data mining, where the results of various location. The objective of the analysis is to find the most
visualization, statistical, and artificial intelligence tools show frequent combinations of products bought by the customers.
the user new relationships that provide a deeper The association rules in Section 4. 8 represent the most
understanding of organizational operations. natural methodology here; indeed they were actually
Deployment of the model. Data mining can be used to developed for this purpose. Analyzing the combinations of
both verifY previously held hypotheses, or for knowledge products bought by the customers, and the number of times
discovery (identification of unexpected and useful these combinations are repeated, leads to a rule of the type 'if
relationships). Through the knowledge discovered in the condition, then result' with a corresponding interestingness
earlier phases of the CRISP-DM process, sound models can measurement. Each rule of this type describes a particular
be obtained that may then be applied to business operations local pattern. The set of association rules can be easily
for many purposes, including prediction or identification of interpreted and communicated. Possible disadvantages are
key situations. These models need to be monitored for locality and lack of probability modeling.
changes in operating conditions, because what might be true In shop the recorded transactions are all the transactions
today may not be true a year from now. If significant made by someone holding one of the chain's loyalty cards.
changes do occur, the model should be redone. It's also wise Each card carries a code that identifies features about the
to record the results of data mining projects so documented owner, including important personal characteristics such as
evidence is available for future studies. [5] sex, date of birth, partner's date of birth, number of children,
Fig. 2 shows the CRISP-DM process. profession and education. The card allows the analyst to
follow the buying behavior of its owner: how many times
they go to the supermarket in a given period, what they buy,
whether they follow the promotions, etc. [ 1]
Similar analyses can be found in [ 1].

III. OUR RESEARCH

Our research is focused on implementation the Data


Mining methods to the Six Sigma methodology (to its
phases). We decided to implement market basket analysis to
the Improve phase, with this allows us to predict behavior of
customer. The prediction can determine the main products
of manufacturing process and profile the customer groups.
We suggest creating the Data Warehouse, because
integrity of the data from process is variable. [6]
To provide anonymity of companies, we use common
Figure 2. CRISP-DM process products label terms.

448
2010 International Conference on Networking and Information Technology

Our offer consists of ten products. We store basket product D


contents, basket summary and personal information. For the
sake of personal data we will use only fictive loyalty card
numbers. .
.
The first step we made was the acquirement of overall .

product F ; ; product A

J
pictures of association between products in the basket. We -. -- - - - - - - - - - - - - -- ,- - - - '

used Generalized Rule Induction (GRI) to produce


"
: ,, ;;- .� :..
-- ..

..
-
. --�. -
:
.. - . -
- .. .. .<
:
" . � .. -

association rules. The dataset contains 634 records. Tab. II � � - .


- . -

, .... .. .. -
shows generated association rules between products. Data in
the table are sorted by confidence. product G O':- -' product J
product H
TABLE II. ASSOCIATION RULES (A PART)
o product Ae product Be product C
Consequent Antecedent Support % Confidence % 'product DO product EO product F
product G product B 3,47 95,45 '. product c:IJ product He product I
product D
Oproduct J
product F
product D product B 3,47 95,45 Figure 3. Web plot of associations
product F
product G In the third step we profiled the customer groups. We
product D product A 4,42 92,86 needed to know who these customers are. This could be
product F
achieved by tagging each customer with a flag for each of
product G
product F product C 2,21 92,86 these groups. To build rule-based profiles of these flags we
product D used rule induction (C5.0). Consequently we received the
product G following rules:
product D product E 3,79 91,67
product F Generated rule for group A:
product G
Rule 1 for T (true)
product F product B 3,63 91,3
product D if income <= 16 900
product G and sex=M
product F product D 16,56 88,57 then T
product G
product G product A 4,73 86,67 Generated rules for group B:
product D
Rule 1 for T
product F
product F product A 4,73 86,67
if age <= 19
product D and income> 26 800
product G then T
product G product C 2,37 86,67 Rule 2 for T
product D if age> 22
product F
and age <=24
product G product D 17,03 86,11
product F and income <=20 900
product D product F 17,03 86,11 and payment_method=CASH
product G then T
product D product E 5,68 72,22 Rule 3 for T
IJroduct G if age> 16
and age <=22
These association rules (two-way) show a variety and income> 12 200
between products G, products D and products F. and income <= 15 500
Next step was a graphical view to associated products. and payment_method=CARD
We used a Web plot to show the dependence between the and sex=F
products. The results are showed in Fig. 3. then T
Bold lines in this plot show the groups of customers Rule 4 for T
suggested by the GRI model. In the resulting display, two if age <=24
groups of customers stand out: and income> 26 800
• those who buy product D, product F and product G and payment_method=CHEQUE
(group A), then T
• those who buy product A and product J (group B).

449
2010 International Conference on Networking and Information Technology

Rule 5 for T With this analysis we made profiling products in market


if age> 17 basket. This implementation of Data Mining methods to
and age <=24 Improve phase of Six Sigma methodology might be used to
and income> 17 700 target special offers. These special offers might improve the
and income <=26 800 Six Sigma performance level (indirectly), because we can
and payment_method=CARD spend money with targeting a specific customer group.
and value> 28,74 1 Each implementation of Data Mining methods to Six
and value <=49, 158 Sigma methodology should be evaluated. [8]
then T
Rule 6 for T REFERENCES
if age <= 16 [I] P. Giudici, S. Figini, "Applied Data Mining for Business and Industry.
then T Second Edition". John Wiley & Sons Ltd; 2009. ISBN 978-0-470-
05886-2
Rule 7 for T
if age <=24 [2] W. Bentley, P. T. Davis, "Lean Six Sigma: Secrets for the CEO".
CRC Press; 2010. ISBN 978-1-4398-0379-0
and income> 17 700
[3] C. Gygi, N. DeCarlo, B. Williams, "Six Sigma for Dummies". Wiley
and income <=26 800
Publishing, Inc.; 2005. ISBN 0-7645-6798-5
and payment_method=CARD
[4] D. Larose, "Discovering Knowledge in Data: An Introduction to
and value> 28,74 1 Data Mining". John Wiley; 2005. ISBN 0-471-66657-2
then T [5] D. Olson, D. Dursun, "Advanced Data Mining Techniques". Springer;
2008. ISBN 978-3-540-76916-3
Fig. 4 shows model built in IBM SPSS Modeler. Article in a journal:
[6] R. Halenar, "Loading data into data warehouse and their testing -
13 Zavadzanie Udajov do datoveho skladu a ich testovanie" In: Journal
\d:/
_
of Information Technologies, vol. 2 (2009), pp. 7-14. ISSN 1337-
7469.
Article in a conference proceedings:
®.o
-0.
-A.
®.o
-0.
-A.
[7] M. Kebisek, M. Elias, "The possibility of utilization of knowledge
discovery in databases in the industry". In: Annals of MTeM for 2009
& Proceedings of the 9th International Conference Modem
Data GRI Web plol Technologies in Manufacturing; 2009 October 8-10, Cluj-Napoca,
Romania. Cluj-Napoca: Technical University of Cluj-Napoca, 2009.
pp. 139-142 ISBN 973-7937-07-04.
[8] J. Zeman, P. Tanuska, M. Kebisek, "The Utilization of Metrics
group A group A group B group B
Usability To Evaluate The Software Quality". In: ICCTD 2009
Figure 4. Model built with GR!, Web plot and C5.0 algorithm International Conference on Computer Technology and Development.
13-15 November 2009, Kota Kinabalu, Malaysia. IEEE Computer
Society, 2009. - ISBN 978-0-7695-3892-1

450

You might also like