UNIT - 3 Final
UNIT - 3 Final
UNIT - 3 Final
UNIT - III
Syllabus :
UNIT III: (14 Hrs) WEB MINING: Introduction-Web content
mining-Web usage mining-Text mining Unstructured text-
Episode rule discovery for text-Hierarchy of categories-Text
clustering.
What is Web Mining?
Web mining can widely be seen as the application of adapted data mining techniques to the
web, whereas data mining is defined as the application of the algorithm to discover patterns on
mostly structured data embedded into a knowledge discovery process
Web mining has a distinctive property to provide a set of various data types
The web has multiple aspects that yield different approaches for the mining process, such as
web pages consist of text, web pages are linked via hyperlinks, and user activity can be
monitored via web server logs
These three features lead to the differentiation between the three areas are web content mining,
web structure mining, web usage mining
Applications of Web Mining :
Web mining helps to improve the power of web search engine by classifying the web
documents and identifying the web page
It is used for Web Searching e.g., Google, Yahoo etc and Vertical Searching e.g.,
FatLens, Become etc
Web mining is very useful of a particular Website and e-service e.g., landing page
optimization
Challenges in Web Mining :
The web poses great challenges for resource and knowledge discovery based on the
following observations −
The web is too huge − The size of the web is very huge and rapidly increasing
This seems that the web is too huge for data warehousing and data mining
Complexity of Web pages − The web pages do not have unifying structure
There are huge amount of documents in digital library of web these libraries are not
arranged according to any particular sorted order
Continued…
Web is dynamic information source − The information on the web is rapidly updated The data such
as news, stock markets, weather, sports, shopping, etc., are regularly updated
Diversity of user communities − The user community on the web is rapidly expanding These users
have different backgrounds, interests, and usage purposes
There are more than 100 million workstations that are connected to the Internet and still rapidly
increasing
We either use browser or search services when we want to find specific information on the web
We specify a simple keyword query and a response from a web search engine in a list of pages ranked based
on their similarity to the query
Low precision : this is due to the irrelevance of many of the search results. We may get many list of pages
which are not relevant to our query
Low recall: This is due to the inability to index all the information available on the web. Because some of the
pages are not properly indexed we may not get those pages through any of the search engines
Discovering New knowledge from the Web:
We can term the above problem as a query triggered process
On the other hand, we can have a data triggered process that presumes the we already have a
collection of web data and we want to extract potentially useful knowledge out of it
Personalized web page synthesis:
We may wish to synthesize the web page for a individual from the available set of web pages
Individuals may have their own preferences in the style of content and the presentations while
interacting with the web
The information provides like to create a system which responds to user queries by potentially
information from the several sources in a manner which is dependent on the user
Learning about Individual Users:
It is about knowing what the customers do and want
Web Mining techniques provide a direct solutions to the problems
Continued…
In terms of data mining, Web mining can be said to have three operations of interests
Clustering
Associations
Sequential Analysis
Types of Web Mining:
Mining techniques can be broadly classified
into three groups:
Web structure mining helps to find useful knowledge or information pattern from the structure
of hyperlinks
Due to heterogeneity and absence of structure in web data, automated discovery of new
knowledge pattern can be challenging to some extent
Web content mining performs scanning and mining of the text, images and groups of web
pages according to the content of the input (query), by displaying the list in search engines
Continued…
Web content consist of several types of data – text, image, audio, video etc. Content data is the
group of facts that a web page is designed
Text documents are related to text mining, machine learning and natural language processing
This type of mining performs scanning and mining of the text, images and groups of web
pages according to the content of the input
For example: If an user wants to search for a particular book, then search engine provides the
list of suggestions
Web Structure Mining:
Web structure mining is the application of discovering structure information from the web
The structure of the web graph consists of web pages as nodes, and hyperlinks as edges connecting
related pages
Structure mining basically shows the structured summary of a particular website. It identifies
To determine the connection between two commercial websites, Web structure mining can be very useful
Example: Web structure mining can be very useful to companies to determine the connection between
Web server registers a web log entry for every web page
Analysis of similarities in web log records can be useful to identify the potential
customers for e-commerce companies
Some of the techniques to discover and
analyze the web usage pattern are:
i) Session and visitor analysis:
The analysis of preprocessed data can be performed in session analysis which includes the
record of visitors, days, sessions etc
This information can be used to analyze the behavior of visitors
Report is generated after this analysis, which contains the details of frequently visited web
pages, common entry and exit
ii) OLAP (Online Analytical Processing):
OLAP performs Multidimensional analysis of complex data
OLAP can be performed on different parts of log related data in a certain interval of time
The OLAP tool can be used to derive the important business intelligence metrics
Approaches In Web Usage Mining :
i) General Access Pattern Tracking:
This is to learn user navigation pattern
The general tracking pattern analyses the web logs to understand access patterns and trends
These analyses can shed better light on the structure and grouping of resource providers
ii) Customized usage Tracking:
This is to learn about the user profiling and user modelling in adaptive interfaces
Customized usage tracing analyses Individual trends
Its purpose is to customize website to users
Continued…
The mining techniques for web usage mining can be classified into two
commonly used approach:
It maps the usage data of web server into relational table before a data mining
techniques is performed
Pre-processing and data cleansing tasks are performed to distinguish and eliminate inconsistency from
the data
The data cleansing process makes sure to capture the genuine text, and it is performed to eliminate stop
words stemming (the process of identifying the root of a certain word and indexing the data
Processing and controlling tasks are applied to review and further clean the data set
Continued…
Pattern analysis is implemented in Management Information System
Text Summarization: To extract its partial content reflection its whole content automatically
Text Categorization: To assign a category to the text among categories predefined by users
Text Clustering: To segment texts into several clusters, depending on the substantial relevance
Text Mining Techniques:
Clustering: It is an unsupervised learning process that grouping of text according to their similar
characteristics
Text Summarization: To extract its partial content reflection it’s whole content automatically
Application Area of Text Mining:
1. Digital Library
2. Academic and Research Field
3. Life Science
4. Social-Media
5. Business Intelligence
Issues in Text Mining:
1. The efficiency and effectiveness of decision-making
2. The uncertain problem can come at an intermediate stage of text mining. In the pre-
processing stage, different rules and guidelines are characterized to normalize the text that
makes the text mining process efficient. Prior to applying pattern analysis on the document,
there is a need to change over unstructured data into a moderate structure
3. Sometimes original message or meaning can be changed due to alteration
4. Another issue in text mining is many algorithms and techniques support multi-language
text. It may create ambiguity in text meaning. This problem can lead to false-positive results
5. The utilization of synonym, polysemy, and antonyms in the document text makes issues for
the text mining tools that take both in a similar setting. It is difficult to categorize such kinds of
text/ words
Unstructured Text:
Unstructured data is the data which does not conforms to a data model and has
no easily identifiable structure such that it can not be used by a computer
program easily
Data can not be stored in the form of rows and columns as in Databases
Due to lack of identifiable structure, it can not used by computer programs easily
Sources of Unstructured Data:
Web pages
Images (JPEG, GIF, PNG, etc.)
Videos
Memos
Reports
Word documents and PowerPoint presentations
Surveys
Advantages of Unstructured Data:
Its supports the data which lacks a proper format or sequence
The data is not constrained by a fixed schema
Very Flexible due to absence of schema
Data is portable
It is very scalable
It can deal easily with the heterogeneity of sources
These type of data have a variety of business intelligence and analytics applications
Disadvantages Of Unstructured data:
It is difficult to store and manage unstructured data due to lack of schema and
structure
Indexing the data is difficult and error prone due to unclear structure and not having
pre-defined attributes
Due to unclear structure, operations like update, delete and search is very
difficult
The bag of words or vector representation takes single words found in the
training corpus as features ignoring the sequences in which the words occur
This representation is based on the statistic about single words in isolation
2. Stop Words:
The feature selection includes removing the case, punctuation, infrequent
words and stop words
Features:
3. Latent Semantic Indexing:
Latent Semantic Indexing (LSI) transforms the original document vectors to a lower
dimensional space by analysing the correlation structure of terms in the document
collection, such that similar documents that do not share terms are placed in the same
topic
4. Stemming:
It reduces the words in the morphological roots
5. n – Gram:
Other feature rcpresen tat1ons are also possible such as using information about the word positions in
the document r using n'-grams representation
Features:
6. Part of Speech (POS):
One important feature is the POS. There can be 25 possible values for POS tags. Most common tags are noun, verb,
adjective and adverb. Thus we can assign a number 1,2,3, 4, and 5 depending on whether the word is a noun, verb,
adjective, adverb or any other respectively
7. Positional Collocations:
The values of this type of feature are the words that occur one or two positions to the right or left of the given
word
8. Higher order Features:
The features are extracted, the text is represented as structured data, and traditional data mining techniques
can be used. The techniques include discovering frequent sets, frequent sequences and episode rules