DWM Mini Project

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY


JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

Introduction:

TANAGRA is free DATA MINING software for academic and research purposes. It
proposes several data mining methods from exploratory data analysis, statistical learning,
machine learning and databases area.

This project is the successor of SIPINA which implements various supervised


learning algorithms, especially an interactive and visual construction of decision trees.
TANAGRA is more powerful, it contains some supervised learning but also other paradigms
such as clustering, factorial analysis, parametric and nonparametric statistics, association rule,
feature selection and construction algorithms...

TANAGRA is an "open source project" as every researcher can access to the source
code, and add his own algorithms, as far as he agrees and conforms to the software
distribution license.

The main purpose of Tanagra project is to give researchers and students an easy-to-
use data mining software, conforming to the present norms of the software development in
this domain (especially in the design of its GUI and the way to use it), and allowing to
analyze either real or synthetic data.

The second purpose of TANAGRA is to propose to researchers an architecture


allowing them to easily add their own data mining methods, to compare their performances.
TANAGRA acts more as an experimental platform in order to let them go to the essential of
their work, dispensing them to deal with the unpleasant part in the programmation of this
kind of tools: the data management.

The third and last purpose, in direction of novice developers, consists in diffusing a
possible methodology for building this kind of software. They should take advantage of free
access to source code, to look how this sort of software is built, the problems to avoid, the
main steps of the project, and which tools and code libraries to use for. In this way, Tanagra
can be considered as a pedagogical tool for learning programming techniques.

TANAGRA does not include, presently, what makes all the strength of the
commercial softwares in this domain: a wide set of data sources, direct access to data
warehouses and databases, data cleansing, interactive utilization,...

DEPARTMENT: INFORMATION TECHNOLOGY Page 1


Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY


JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

Import dataset into Tanagra:

1. Choose “File/New…” in the main menu of TANAGRA

2. Enter a title for the diagram: « TANAGRA : Importing Data »


3. Enter the name of the associated file in which you will save your work
(« TANAGRA_ImportingData.bdm »).
4. Before click on Save button, to run through the hard disk and place yourself in the
directory « …\TANAGRA\Tutorials ».
5. Click on the open button icon to seek the file you have created “weather.txt”.

DEPARTMENT: INFORMATION TECHNOLOGY Page 2


Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY


JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

6. Validate with OK to start data importation.

A new diagram is created, based on the file « weather.txt ». You can see the
description of its contents in the right frame.

DEPARTMENT: INFORMATION TECHNOLOGY Page 3


Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY


JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

This project is undertaken in the subject of Data warehouse and Mining and Business
Intelligence. It is a tool based project. We are using the Tanagra tool and the database used is
the weather report. In this project we are going to show all the attributes affecting the weather
and it includes attributes such as temperature, humidity, windy etc. This gives us a brief idea
of the weather of the area. Using Tanagra tool we can derive different conclusions about the
given database. By using visualization, regression techniques, association and K means
method helps us derive different observations and conclusions about the database. To view
the data in graphical form we use Scatter plot. Tanagra tool helps us to get an overview of
this database.

Database details:

The database that is used in this mini project includes the results of weather and their
information. This data consists of various fields. The database is available as an Excel
document. The Excel document consists of records of 15 weather.

Tanagra loads data from text files with tab separator, built in the following way:
- 1st line: names of attributes
- Next lines: values of the attributes for the sample (one line for each record).

This text file (Dataset) includes the following attributes:


1. Outlook
2. Temp
3. Humidity
4. Windy
5. Class

The dataset contain two continuous attribute and three are discrete attribute .

The discrete values of attribute are as follows:-


Outlook = “sunny”,”overcast”,”rain”.
Windy = “yes”,”no”.
Class =”play”,”dontplay“.

This project contains analysis of the above database in terms of


1. Scatter plot with label
2. clustering
3. association
4. Regression tree

DEPARTMENT: INFORMATION TECHNOLOGY Page 4


Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY


JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

Scatter plot with label:-

Problem statement:
The dataset provided enormous information about the weather. This data set is plotted to
form a scatter plot with label. The features taken account to plot the scatter graph are
1. Temp
2. Humidity
The scatter plot with label is a tool to provide a graphical view which must include all this
information.

Steps for creating the scatter plot with label:


1. The dataset (weather.txt) to be classified is loaded into the Tanagra statistics data
editor.
2. Open data visualization tab from the component bar.
3. And select the scatter plot with label option from the visualization tab.
4. Drag this option onto dataset and open it.
5. The output appears in right frame.
6. It contains the scatter plot for the chosen features.

By using data visualization we have derived the scatter plot of the attributes humidity and
temperature.

DEPARTMENT: INFORMATION TECHNOLOGY Page 5


Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY


JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

Clustering:-

Problem statement:
The data set provides vast information based on different characteristics and features.
Clustering is the task of segmenting a diverse group into number of more similar subgroups
or clusters. Here the clustering is done on the attribute Temp and Humidity.

Step for creating the clustering (k mean):


1. Add a Define Status operator under the “Dataset” node, by clicking on its icon in the
shortcuts toolbar. A dialog box appears automatically, allowing the definition of
the status of the attributes.
2. Before all, be sure that the active tab in the dialog is the “Input” one. Then select the
continuous attributes in the left list by clicking the corresponding button below the list
(as shown in the following screenshot), and hit the arrow button to bring them in the
Input list.

DEPARTMENT: INFORMATION TECHNOLOGY Page 6


Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY


JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

3. Select two continuous attributes for input value.


4. Now you have defined the descriptors to do this. Click OK to validate and close this
dialog box.
5. Drag the k-mean option onto Define Status 1 for which we define the descriptor.
6. And select view option by right clicking on k-mean 1 option.
7. The output appears in right frame.
8. It contains the clustering for the chosen features.

By performing k-mean clustering operation we have grouped the data into more
manageable, distinct and fixed number of cluster.

DEPARTMENT: INFORMATION TECHNOLOGY Page 7


Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY


JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

Association:-

Problem statement:
The data set provides vast information based on different characteristics and features.
It is used to find relationship in database. The relationship has been shown between three
attributes outlook, windy and class.

Step for creating the association:


1. Add a Define Status operator under the “Dataset” node, by clicking on its icon in the
shortcuts toolbar. A dialog box appears automatically, allowing the definition of
the status of the attributes.
2. Before all, be sure that the active tab in the dialog is the “Input” one. Then select the
continuous attributes in the left list by clicking the corresponding button below the list
(as shown in the following screenshot), and hit the arrow button to bring them in the
Input list.

3. Select one continuous attribute for input value as temp.


4. And select one continuous attribute for target value as humidity.
5. In the same dialog box, activate the Target tab. Select the « class » attribute in the list
and click the arrow button.
6. Now you have defined the class attribute (« class » = Target), and the descriptors to
do this (the others = Input).
7. Click OK to validate and close this dialog box

DEPARTMENT: INFORMATION TECHNOLOGY Page 8


Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY


JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

8. Drag Apriori on define status and see the output appears in right frame.
9. It contains the Association for the chosen features

DEPARTMENT: INFORMATION TECHNOLOGY Page 9


Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY


JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

By performing association we have manage to show distinct link between two attributes.

DEPARTMENT: INFORMATION TECHNOLOGY Page 10


Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY


JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

Regression Tree:-

Problem statement:
The data set contains all the information according to the various attributes. We
attempt to use regression tree to find the relationship between variables temp and humidity.

Step for creating the Regression:


1. Add a Define Status operator under the “Dataset” node, by clicking on its icon in the
shortcuts toolbar. A dialog box appears automatically, allowing the definition of
the status of the attributes.
2. Before all, be sure that the active tab in the dialog is the “Input” one. Then select the
continuous attributes in the left list by clicking the corresponding button below the list
(as shown in the following screenshot), and hit the arrow button to bring them in the
Input list.

3. Select one continuous attribute for input value as temp.


4. And select one continuous attribute for target value as humidity.

DEPARTMENT: INFORMATION TECHNOLOGY Page 11


Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY


JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

5. In the same dialog box, activate the Target tab. Select the « class » attribute in the list
and click the arrow button.
6. Now you have defined the class attribute (« class » = Target), and the descriptors to
do this (the others = Input).
7. Click OK to validate and close this dialog box

8. Drag Regression tree on define status and see the output appears in right frame.
9. It contains the Association for the chosen features

DEPARTMENT: INFORMATION TECHNOLOGY Page 12


Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY


JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

By constucting the regression tree we have been able to show the relationship.

DEPARTMENT: INFORMATION TECHNOLOGY Page 13


Manjra Charitable Trust's

RAJIV GANDHI INSTITUTE OF TECHNOLOGY


JUHU VERSOVA LINK ROAD, ANDHERI (W), MUMBAI 400 053

This was a mini project in DWMI using tool Tanagra.

We have successfully completed the analysis of the above data set. The data set
contained the information about weather details. Using Tanagra, we could carry out analysis
of the data using the tools provided.

The Scatter Plot with Label was then carried out on the above dataset. The features
taken account to plot the scatter graph are temperature and humidity of the weather. The
scatter plot is a tool to provide a graphical view which includes all this information.

The Clustering Analysis is used to produce segmenting a diverse group into number
of more similar subgroups and clustering is done on the attribute Temp and Humidity.

Association Analysis is used to find relationship in database. The relationship has


been shown between three attributes outlook, windy and class.

Regression tree was carried on the data to find the relationship between variables like
temperature and humidity.

Hence, we have successfully completed the mini project.

DEPARTMENT: INFORMATION TECHNOLOGY Page 14

You might also like