UNIT - 3 Final

Data Mining Techniques
UNIT - III
Syllabus :
UNIT III: (14 Hrs) WEB MINING: Introduction-Web content
mining-Web usage mining-Text mining Unstructured text-
Episode rule discovery for text-Hierarchy of categories-Text
clustering.
What is Web Mining?
Web mining can widely be seen as the application of adapted data mining techniques to the
web, whereas data mining is defined as the application of the algorithm to discover patterns on
mostly structured data embedded into a knowledge discovery process
Web mining has a distinctive property to provide a set of various data types
The web has multiple aspects that yield different approaches for the mining process, such as
web pages consist of text, web pages are linked via hyperlinks, and user activity can be
monitored via web server logs
These three features lead to the differentiation between the three areas are web content mining,
web structure mining, web usage mining
Applications of Web Mining :
Web mining helps to improve the power of web search engine by classifying the web
documents and identifying the web page
It is used for Web Searching e.g., Google, Yahoo etc and Vertical Searching e.g.,
FatLens, Become etc
Web mining is used to predict user behavior
Web mining is very useful of a particular Website and e-service e.g., landing page
optimization
Challenges in Web Mining :
The web poses great challenges for resource and knowledge discovery based on the
following observations −
The web is too huge − The size of the web is very huge and rapidly increasing
This seems that the web is too huge for data warehousing and data mining
Complexity of Web pages − The web pages do not have unifying structure
They are very complex as compared to traditional text document
There are huge amount of documents in digital library of web these libraries are not
arranged according to any particular sorted order
Continued…
Web is dynamic information source − The information on the web is rapidly updated The data such
as news, stock markets, weather, sports, shopping, etc., are regularly updated
Diversity of user communities − The user community on the web is rapidly expanding These users
have different backgrounds, interests, and usage purposes
There are more than 100 million workstations that are connected to the Internet and still rapidly
increasing
Relevancy of Information − It is considered that a particular person is generally interested in only

small portion of the web, while the rest of the portion of the web contains the information that is not
relevant to the user and may swamp desired results
Purpose of Web Mining :
Finding Relevant Information:
We either use browser or search services when we want to find specific information on the web
We specify a simple keyword query and a response from a web search engine in a list of pages ranked based
on their similarity to the query
Today’s search tools have the following problems:
Low precision : this is due to the irrelevance of many of the search results. We may get many list of pages
which are not relevant to our query
Low recall: This is due to the inability to index all the information available on the web. Because some of the
pages are not properly indexed we may not get those pages through any of the search engines
Discovering New knowledge from the Web:
We can term the above problem as a query triggered process
On the other hand, we can have a data triggered process that presumes the we already have a
collection of web data and we want to extract potentially useful knowledge out of it
Personalized web page synthesis:
We may wish to synthesize the web page for a individual from the available set of web pages
Individuals may have their own preferences in the style of content and the presentations while
interacting with the web
The information provides like to create a system which responds to user queries by potentially
information from the several sources in a manner which is dependent on the user
Learning about Individual Users:
It is about knowing what the customers do and want
Web Mining techniques provide a direct solutions to the problems
Continued…
In terms of data mining, Web mining can be said to have three operations of interests
Clustering
Associations
Sequential Analysis
Types of Web Mining:
Mining techniques can be broadly classified
into three groups:
Web Content Mining

Web Structure Mining
Web Usage Mining
Web Mining Tasks:
Web Content Mining:
Web content mining can be used for mining of useful data, information and knowledge from
web page content
Web structure mining helps to find useful knowledge or information pattern from the structure
of hyperlinks
Due to heterogeneity and absence of structure in web data, automated discovery of new
knowledge pattern can be challenging to some extent
Web content mining performs scanning and mining of the text, images and groups of web
pages according to the content of the input (query), by displaying the list in search engines
Continued…
Web content consist of several types of data – text, image, audio, video etc. Content data is the
group of facts that a web page is designed
It can provide effective and interesting patterns about user needs
Text documents are related to text mining, machine learning and natural language processing
This mining is also known as text mining
This type of mining performs scanning and mining of the text, images and groups of web
pages according to the content of the input
For example: If an user wants to search for a particular book, then search engine provides the
list of suggestions
Web Structure Mining:
Web structure mining is the application of discovering structure information from the web
The structure of the web graph consists of web pages as nodes, and hyperlinks as edges connecting
related pages
Structure mining basically shows the structured summary of a particular website. It identifies
relationship between web pages linked by information or direct link connection
To determine the connection between two commercial websites, Web structure mining can be very useful
Example: Web structure mining can be very useful to companies to determine the connection between
two commercial websites

Web Usage Mining:
Web usage mining is used for mining the web log records (access information of web
pages) and helps to discover the user access patterns of web pages
Web server registers a web log entry for every web page
Analysis of similarities in web log records can be useful to identify the potential
customers for e-commerce companies
Some of the techniques to discover and
analyze the web usage pattern are:
i) Session and visitor analysis:
The analysis of preprocessed data can be performed in session analysis which includes the
record of visitors, days, sessions etc
This information can be used to analyze the behavior of visitors
Report is generated after this analysis, which contains the details of frequently visited web
pages, common entry and exit
ii) OLAP (Online Analytical Processing):
OLAP performs Multidimensional analysis of complex data
OLAP can be performed on different parts of log related data in a certain interval of time
The OLAP tool can be used to derive the important business intelligence metrics
Approaches In Web Usage Mining :
i) General Access Pattern Tracking:
This is to learn user navigation pattern
The general tracking pattern analyses the web logs to understand access patterns and trends
These analyses can shed better light on the structure and grouping of resource providers
ii) Customized usage Tracking:
This is to learn about the user profiling and user modelling in adaptive interfaces
Customized usage tracing analyses Individual trends
Its purpose is to customize website to users
Continued…
The mining techniques for web usage mining can be classified into two
commonly used approach:
It maps the usage data of web server into relational table before a data mining
techniques is performed
It uses the log data directly by utilizing special pr4eprocessing techniques
Web usage data can also be represented with graphs

Text Mining:
Text mining is a process of extracting useful information and nontrivial patterns from a large
volume of text databases
There exist various strategies and devices to mine the text and find important data for the
prediction and decision-making process
The selection of the right and accurate text mining procedure helps to enhance the speed and the
time complexity also
“Text Mining is the procedure of synthesizing information, by analyzing relations, patterns, and
rules among textual data.”
Text mining is a part of Data mining to extract valuable text information from a text database
repository
Text mining is a multi-disciplinary field based on data recovery, Data mining, AI, statistics,
Machine learning, and computational linguistics
The Conventional Process Of Text
Mining As Follows:
Gathering unstructured information from various sources accessible in various document
organizations, for example, plain text, web pages, PDF records, etc
Pre-processing and data cleansing tasks are performed to distinguish and eliminate inconsistency from
the data
The data cleansing process makes sure to capture the genuine text, and it is performed to eliminate stop
words stemming (the process of identifying the root of a certain word and indexing the data
Processing and controlling tasks are applied to review and further clean the data set
Continued…
Pattern analysis is implemented in Management Information System
Information processed in the above steps is utilized to extract important and

applicable data for a powerful and convenient decision-making process and
trend analysis
Procedures of analyzing Text Mining:
Text Summarization: To extract its partial content reflection its whole content automatically
Text Categorization: To assign a category to the text among categories predefined by users
Text Clustering: To segment texts into several clusters, depending on the substantial relevance
Text Mining Techniques:
Information Extraction: It is a process of extract meaningful words from documents
Information Retrieval: It is a process of extracting relevant and associated patterns according to

a given set of words or text documents
Natural Language Processing : It concerns the automatic processing and analysis of

unstructured text information
Clustering: It is an unsupervised learning process that grouping of text according to their similar
characteristics
Text Summarization: To extract its partial content reflection it’s whole content automatically
Application Area of Text Mining:
1. Digital Library
2. Academic and Research Field
3. Life Science
4. Social-Media
5. Business Intelligence
Issues in Text Mining:
1. The efficiency and effectiveness of decision-making
2. The uncertain problem can come at an intermediate stage of text mining. In the pre-
processing stage, different rules and guidelines are characterized to normalize the text that
makes the text mining process efficient. Prior to applying pattern analysis on the document,
there is a need to change over unstructured data into a moderate structure
3. Sometimes original message or meaning can be changed due to alteration
4. Another issue in text mining is many algorithms and techniques support multi-language
text. It may create ambiguity in text meaning. This problem can lead to false-positive results
5. The utilization of synonym, polysemy, and antonyms in the document text makes issues for
the text mining tools that take both in a similar setting. It is difficult to categorize such kinds of
text/ words
Unstructured Text:
Unstructured data is the data which does not conforms to a data model and has
no easily identifiable structure such that it can not be used by a computer
program easily
Unstructured data is not organized in a pre-defined manner or does not have a

pre-defined data model, thus it is not a good fit for a mainstream relational
database
Characteristics of Unstructured Data:
Data neither conforms to a data model nor has any structure
Data can not be stored in the form of rows and columns as in Databases
Data does not follows any semantic or rules
Data lacks any particular format or sequence
Data has no easily identifiable structure
Due to lack of identifiable structure, it can not used by computer programs easily
Sources of Unstructured Data:
Web pages
Images (JPEG, GIF, PNG, etc.)
Videos
Memos
Reports
Word documents and PowerPoint presentations
Surveys
Advantages of Unstructured Data:
Its supports the data which lacks a proper format or sequence
The data is not constrained by a fixed schema
Very Flexible due to absence of schema
Data is portable
It is very scalable
It can deal easily with the heterogeneity of sources
These type of data have a variety of business intelligence and analytics applications
Disadvantages Of Unstructured data:
It is difficult to store and manage unstructured data due to lack of schema and
structure
Indexing the data is difficult and error prone due to unclear structure and not having
pre-defined attributes
Due to which search results are not very accurate
Ensuring security to data is difficult task

Problems faced in storing Unstructured
Data:
It requires a lot of storage space to store unstructured data
It is difficult to store videos, images, audios, etc
Due to unclear structure, operations like update, delete and search is very
difficult
Storage cost is high as compared to structured data
Indexing the unstructured data is difficult

Features:
1. Word Occurrences:
The bag of words or vector representation takes single words found in the
training corpus as features ignoring the sequences in which the words occur
This representation is based on the statistic about single words in isolation
2. Stop Words:
The feature selection includes removing the case, punctuation, infrequent
words and stop words
Features:
3. Latent Semantic Indexing:
Latent Semantic Indexing (LSI) transforms the original document vectors to a lower
dimensional space by analysing the correlation structure of terms in the document
collection, such that similar documents that do not share terms are placed in the same
topic
4. Stemming:
It reduces the words in the morphological roots
5. n – Gram:
Other feature rcpresen tat1ons are also possible such as using information about the word positions in
the document r using n'-grams representation
Features:
6. Part of Speech (POS):
One important feature is the POS. There can be 25 possible values for POS tags. Most common tags are noun, verb,
adjective and adverb. Thus we can assign a number 1,2,3, 4, and 5 depending on whether the word is a noun, verb,
adjective, adverb or any other respectively
7. Positional Collocations:
The values of this type of feature are the words that occur one or two positions to the right or left of the given
word
8. Higher order Features:
The features are extracted, the text is represented as structured data, and traditional data mining techniques
can be used. The techniques include discovering frequent sets, frequent sequences and episode rules

UNIT - 3 Final

Uploaded by

Copyright:

Available Formats

UNIT - 3 Final

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT - 3 Final

Uploaded by

Copyright:

Available Formats

Data Mining Techniques

Web mining is used to predict user behavior

They are very complex as compared to traditional text document

Relevancy of Information − It is considered that a particular person is generally interested in only

Today’s search tools have the following problems:

Web Content Mining

It can provide effective and interesting patterns about user needs

This mining is also known as text mining

relationship between web pages linked by information or direct link connection

two commercial websites

It uses the log data directly by utilizing special pr4eprocessing techniques

Web usage data can also be represented with graphs

Information processed in the above steps is utilized to extract important and

Information Extraction: It is a process of extract meaningful words from documents

Information Retrieval: It is a process of extracting relevant and associated patterns according to

Natural Language Processing : It concerns the automatic processing and analysis of

Unstructured data is not organized in a pre-defined manner or does not have a

Data does not follows any semantic or rules

Data lacks any particular format or sequence

Data has no easily identifiable structure

Due to which search results are not very accurate

Ensuring security to data is difficult task

It is difficult to store videos, images, audios, etc

Storage cost is high as compared to structured data

Indexing the unstructured data is difficult

You might also like