Web Mining
Web Mining
Web Mining
SEMINAR REPORT
WEB MINING
1 INTRODUCTION 3
2 WEB MINING 5
6 CONCLUSION 20
1
List of Figures
2
Chapter 1
INTRODUCTION
3
Low Precision: User cannot browse all the pages one by one, and most
pages are irrelevant to the user’s meaning, they are highlighted and returned
by searching engine just because these pages in possession of the key words[4].
Web mining techniques could be used to solve the information over load
problems directly or indirectly. However, Web mining techniques are not the
only tools. Other techniques and works from different research areas, such
as DataBase (DB), Information Retrieval (IR), Natural Language Processing
(NLP), and the Web document community, could also be used[2].
INFORMATION RETRIEVAL
Information retrieval is the art and science of searching for information in doc-
uments, searching for documents themselves, searching for metadata which
describes documents, or searching within databases, whether relational stan-
dalone databases or hypertext networked databases such as the Internet or
intranets, for text, sound, images or data[1].
4
Chapter 2
WEB MINING
5
electronic newsletters, electronic newswire, the text contents of HTML doc-
uments obtained by removing HTML tags, and also the manual selection of
Web resources.
6
Chapter 3
CHALLENGES OF WEB
MINING
1. Today World Wide Web is flooded with billions of static and dynamic
web pages created with programming languages such as HTML, PHP and
ASP. It is significant challenge to search useful and relevant information on
the web.
7. The web is noisy i.e. a page typically contains a mixture of many kinds of
information like, main content, advertisements, copyright notice, navigation
panels.
9. The Web is not only disseminating information but it also about services.
7
Many Web sites and pages enable people to perform operations with input
parameters, i.e., they provide services.
8
Chapter 4
TAXONOMY OF WEB
MINING
However, there are two other different approaches to categorize Web min-
ing. In both, the categories are reduced from three to two: Web content
mining and Web usage mining. In one, Web structure is treated as part of
Web Content while in the other Web usage is treated as part of Web Struc-
ture. All of the three categories focus on the process of knowledge discovery
of implicit, previously unknown and potentially useful information from the
Web. Each of them focuses on different mining objects of the Web[2].
9
Figure 4.1: Taxonomy of Web mining
10
discover Web information based on these preferences, and preferences of other
users with similar interest[5].
11
Hyperlinks
A hyperlink is a structural unit that connects a location in a web page to
a different location, either within the same web page or on a different web
page. A hyperlink that connects to a different part of the same page is called
an Intra-document hyperlink, and a hyperlink that connects two different
pages is called an inter-document hyperlink.
Document Structure
In addition, the content within a Web page can also be organized in a tree
structured format, based on the various HTML and XML tags within the
page. Mining efforts here have focused on automatically extracting docu-
ment object model (DOM) structures out of documents[7].
Web structure mining focuses on the hyperlink structure within the Web
itself. The different objects are linked in some way. Simply applying the
traditional processes and assuming that the events are independent can lead
to wrong conclusions. However, the appropriate handling of the links could
lead to potential correlations, and then improve the predictive accuracy of
the learned models.
Two algorithms that have been proposed to lead with those potential corre-
lations are:
1. HITS and
2. PageRank.
4.2.1 PageRank
Page Rank is a metric for ranking hypertext documents that determines the
quality of these documents. The key idea is that a page has high rank if it
is pointed to by many highly ranked pages. So the rank of a page depends
upon the ranks of the pages pointing to it. This process is done iteratively
till the rank of all the pages is determined[4].
d X P R(q)
P R(p) = + (1 − d) ( ) (4.1)
n (q,p)∈G
Outdegree(q)
12
4.2.2 HITS
Hyperlink-induced topic search (HITS) is an iterative algorithm for mining
the Web graph to identify topic hubs and authorities. Authorities are the
pages with good sources of content that are referred by many other pages
or highly ranked pages for a given topic; hubs are pages with good sources
of links. The algorithm takes as input, search results returned by tradi-
tional text indexing techniques, and filters these results to identify hubs and
authorities[1]. The number and weight of hubs pointing to a page determine
the page’s authority. The algorithm assigns weight to a hub based on the
authoritativeness of the pages it points to. If many good hubs point to a
page p, then authority of that page p increases. Similarly if a page p points
to many good authorities, then hub of page p increases[4].
After the computation, HITS outputs the pages with the largest hub weight
and the pages with the largest authority weights, which is the search result
of a given topic.
Web usage mining focuses on techniques that could predict the behavior
of users while they are interacting with the WWW. It collects the data from
Web log records to discover user access patterns of Web pages. Usage data
captures the identity or origin of web users along with their browsing behav-
ior at a web site.
In the using and mining of Web data, the most direct source of data are
Web log files on the Web server. Web log files records of the visitor’s brows-
ing behavior very clearly. Web log files include the server log, agent log and
client log (IP address, URL, page reference, access time, cookies etc.)[3].
13
The Web Usage Mining can be decomposed into the following three main
sub tasks:
4.3.1 Pre-processing
It is necessary to perform a data preparation to convert the raw data for
further process. The actual data collected generally have the features that
incomplete, redundancy and ambiguity[7]. In order to mine the knowledge
more effectively, pre-processing the data collected is essential. Preprocessing
can provide accurate, concise data for data mining. Data preprocessing, in-
cludes data cleaning, user identification, user sessions identification, access
path supplement and transaction identification.
• The main task of data cleaning is to remove the Web log redundant
data which is not associated with the useful data, narrowing the scope
of data objects.
• Determining the single user must be done after data cleaning. The
purpose of user identification is to identify the users uniqueness. It can
be complete by means of cookie technology, user registration techniques
and investigative rules.
• User session identification should be done on the basis of the user iden-
tification. The purpose is to divide each user’s access information into
several separate session processes. The simplest way is to use time-out
estimation approach, that is, when the time interval between the page
requests exceeds the given value, namely, that the user has started a
new session.
14
• Because the widespread use of the page caching technology and the
proxy servers, the access path recorded by the Web server access logs
may not be the complete access path of users. Incomplete access log
does not accurately reflect the user’s access patterns, so it is necessary
to add access path. Path supplement can be achieved using the Web
site topology to make the page analysis.
• The transaction identification is based on the user’s session recogni-
tion, and its purpose is to divide or combine transactions according to
the demand of data mining tasks in order to make it appropriate for
demand of data mining analysis[3].
15
done by using supervised inductive learning algorithms such as decision
tree classifiers, nave Bayesian classifiers, k-nearest neighbor classifier,
Support Vector Machines etc.
16
Chapter 5
APPLICATIONS OF WEB
MINING
Web mining techniques can be applied to understand and analyze such data,
and turned into actionable information, that can support a web enabled
electronic business to improve its marketing, sales and customer support
operations. Based on the patterns found and the original cache and log data,
many applications can be developed. Some of them are:
Early on in the life of Amazon.com, its visionary CEO Jeff Bezos ob-
served, In a traditional (brick-and mortar) store,the main effort is in getting
a customer to the store. Once a customer is in the store they are likely to
make a purchase - since the cost of going to another store is high and thus
the marketing budget (focused on getting the customer to the store) is in
general much higher than the in-store customer experience budget (which
17
keeps the customer in the store). In the case of an on-line store, getting in or
out requires exactly one click, and thus the main focus must be on customer
experience in the store. This fundamental observation has been the driv-
ing force behind Amazons comprehensive approach to personalized customer
experience, based on the mantra a personalized store for every customer.
A host of Web mining techniques, e.g. associations between pages visited,
click-path analysis, etc., are used to improve the customers experience dur-
ing a store visit. Knowledge gained from Web mining is the key intelligence
behind Amazons features such as instant recommendations, purchase circles,
wish-lists, etc[3].
18
The predicting capability of the mining application can also benefits the
society by identifying criminal activities[1].
19
Chapter 6
CONCLUSION
As the Web and its usage continue to grow, so does the opportunity to analyze
Web data and extract all manner of useful knowledge from it. The past few
years have seen the emergence of Web mining as a rapidly growing area,
due to the efforts of the research community as well as various organizations
that are practicing. The key component of web mining is the mining process
itself. Here we have described the key computer science contributions made
in this field, including the overview of web mining, taxonomy of web mining,
the prominent successful applications, and outlined some promising areas of
future research.
20
Bibliography
[2] http://www.galeas.de/webimining.html
[6] Brijendra Singh, Hemant Kumar Singh, WEB DATA MINING RE-
SEARCH: A SURVEY, 2010 IEEE
[7] Mining the Web: discovering knowledge from hypertext data, Part 2
By Soumen Chakrabarti, 2003 edition
21