Data Mining
Data Mining
Data Mining
. Group name id
1. abel belete 0093/13
2. Iyassu Theodros 1405/13
3. Samson Workineh 2139/13
4. Tekalegn Tadewos 2365/13
Data mining
project 1
.
introduction
What is data mining ?
is defined as procedure of extracting
information from huge sets of data.
mining of knowledge from data.
Data Mining is also called Knowledge
Discovery of Data (KDD).
what types of data can be mined ?
Database data (RDBMS):
Data warehouse data:
Transactional database
Database data (RDBMS)
Set of tables.
Has rows and columns.
While mining databases we can search
for trends or data patterns.
Data is stored in the form of table.
Columns represents the attributes .
Data ware house:
Collecting of data integrated from
different sources with querying and
decision making on data.
Data is stored in multi dimensional
structure (data cube) where each
dimension is each attribute.
.
Data cube
Transactional database
Each record is called as transaction
example of a transaction:
customers sales
flight ticket booking
users clicks on web page
A transaction has:
Transaction id
Name of transaction
Time of beginning
Time of end
transaction Location
transaction date
Other types of data that can be mined:
Sequence data : stocks and stock market
Data streams : continuously being
transmitted
Spatial data: maps
Engineering & design data: integrated
circuits
Multimedia: audio, video
Web data: webpage related data
Hypertext: text related
Data(query) pre-processing
The process of transforming raw data
into an understandable format.
major tasks:
1) data cleaning
2) data integration
3) Data reduction
4) data transformation
Data cleaning:
Process of removal of incorrect, incomplete,
inaccurate data, also replaces missing data.
Data integration:
Multiple heterogeneous sources of
data are combined into single dataset.
2 types of data integration:
tight coupling
loose coupling
.
Tight coupling vs loose coupling
Tight coupling means classes and objects
are dependent on one another.
tight coupling is usually not good because it reduces
the flexibility and re-usability of the code.
Loose coupling
Loose coupling means reducing the dependencies of a
class that uses the different class directly.
Tight coupling vs loose coupling
More interdependency Less interdependency
More coordination Less coordination
More information flow Less information flow
Less testability More testability
Data reduction:
Volume of data is reduced to make analysis
easier.
Methods for data reduction:
1. Dimensionality reduction:
Reduces no of i/p variables in the dataset
Large i/p variables -> poor performance
2. Data cube Aggregation
Data is combined to construct a data cube.
redundant, noisy data is removed.
3.Attribute subset selection:
Highly relevant attributes should be used.
Others -> removed.
4. Numerosity reduction:
storing only the model or the sample of
data instead of entire data.
Data transformation
Data is transformed into appropriate form
suitable for mining process.
Methods :
1. Normalization
2. Attribute selection
3. Dicretization
4. Concept hierarchy generation
Data transformation
DATA MINING FUNCTIONALTIES:
1.Class description :
Data is always associated with class concepts:
Descriptions can be done in 2 ways:
data characterization:
refers to the summary of the class/concept.
Output -> general overview
data discrimination:
compares the common features of the classes.
Output -> bar charts, curves, graphs
2.Mining frequent patterns,
associations and correlations
Frequent patterns:
Things which are found most commonly
in data.
frequent item sets(data items)
frequent subsequence
Frequent substructure
Association analysis
It is the way of identifying the relation
among various item.
Correlation analysis
Mathematical method.
Shows how strongly pair of attributes are
related together.
3.Classfication and regression for
predictive analysis
Classification:
Process of finding a model that
distinguishes data items.
Decision tree is used for classification
Regression: statistical methodology that is
used for numeric prediction of missing data.
4. Cluster analysis
The data items are clustered based on the
principle of maximising the intraclass similarity
and minimising the interclass similarity
5.Outlier analysis (anomaly mining)
Among the data items in a database, there
may be some items which do not follow the
general behaviour of data.
sometimes Called :
Noise
Exception
outlier
outlier detection
Data Mining Applications
Market-basket analysis
Market basket analysis is a data
mining technique used by retailers to
increase sales by better understanding
customer purchasing patterns.
It involves analyzing large data sets, such
as purchase history, to reveal product
groupings, as well as products that are
likely to be purchased together.
interestingness of patterns
In data mining system, everyday millions of
data patterns Are generated.
• Among all these patterns generated,
How many are really interesting ?
A small fraction of patterns generated
would be interesting to any given users.
What makes the pattern interesting ?
A pattern is interesting if it is:
easily understood by human.
valid on new/test data.
potentially useful.
Can data mining system generate an
interesting patterns ?
refers to completeness of data mining
system.
in reality it is not possible for a data mining
system to generate all interesting patterns.
Classification of data mining
Bayesian classification
Bayesian classification are statistical
classifications.
They can predict probabilities of class items.
It gives the probability that a given item
belongs to that class/not
Classification based on
application adapted:
Finance
Telecommunications
DNA
Stock markets
Email
general classifications
Conclusion :
Data mining is a big area of data sciences,
which aims to discover patterns and features
in data, often large data sets. It includes
regression, classification, clustering, detection
of anomaly, and others. It also includes
preprocessing, validation, summarization, and
ultimately the making sense of the data sets.
It primarily turns raw data into useful
information
References
Wikipedia.org
YouTube videos
Different articles
.