Ijcsi 13 6 68 75

IJCSI International Journal of Computer Science Issues, Volume 13, Issue 6, November 2016
ISSN (Print): 1694-0814 | ISSN (Online): 1694-0784

www.IJCSI.org https://doi.org/10.20943/01201606.6875 68
Graph Theoretic and Genetic Algorithm-Based Model

for Web Content Mining
Moses Akinjide Adelola1, Olumide Sunday Adewale2, Gabriel Babatunde Iwasokun2
1ICTA Centre, Adekunle Ajasin University, Akungba,
Akungba, Ondo State, Nigeria
2Department of Computer Science, Federal University of Technology,

Akure, Ondo State, Nigeria
Abstract taken to mine useful information. These challenges

The World Wide Web (www) is arguably the largest and prompted the need for improved way to accessing web
the most heterogeneous repository of data and has documents and shift in paradigm from demand of
continued to expand in size and complexity. With information to demand for knowledge. This also explains
consistency in expansion, retrieval of required web pages why the World Wide Web is gradually transforming into
and information has become a herculean task for web users semantic web, which is knowledge oriented and when
due to information overload and worst still, existing web subjected to mining, leads to discovery of end users-
content retrieval techniques have not exhibited enough oriented knowledge.
efficiency in areas of speed and accuracy. This paper The web mining field encompasses a wide array of issues
presents a Graph Theoretic (GT) and Genetic Algorithm that centered on deriving actionable knowledge from the
(GA)-based technique for mining of web documents. The web as well as information retrieval (IR), database
technique utilizes graph representations of document technologies and artificial intelligence-based researches [2].
content to address the problems of initialization, Information retrieval (IR) is the process of obtaining
convergence to local minimal and failure to handle large relevant information from a collection of resources. It may
datasets. The technique works in three phases; namely rely on searches based on metadata or full-text (or other
contents extraction, preprocessing and database formulation content-based) indexing and can be seen as a standard
while Maximum Common Sub-graph (MCS) was used to optimization problem [3]. IR has search space S
calculate the distance between clusters. Results of the web- represented by the set of documents, a set of possible
based experimental study on Pentium 4 with 2GHz solutions S+ (that is, the possible documents related to the
processor and 1GB RAM running on Window 7 operating user query), and evaluation function, f for the investigation
system platform with web scraper (import.io) as front-end of the relevance of each of the possible documents related
and PHP 6 and MySQL5 as back-ends show the to the user query. Finally, a search engine tries to output
applicability and the superiority of the new techniques over documents that maximize f. The optimal solution is a
some existing ones. document or set of documents that have the maximum
Keywords: Web mining, graph theory, genetic algorithm, score returned by function f. Such an optimization problem
knowledge discovery can be solved efficiently using genetic algorithm, which
offers a probabilistic model that borrows from the process
1. Introduction of natural selection in living organisms. Moreover, genetic
search algorithms enable intelligent and efficient Internet
Internet is a global computer network that has played major searches and are especially useful when the search space is
roles in present day information dissemination and relatively large, as is the case with the web [4].
knowledge discovery. It provides varieties of information
and communication facilities with interconnected networks Web content
that are based on standardized communication protocol. Today, there are information on business transactions,
With its large and dynamic information source that is scientific data, medical and personal data, satellite sensing
structurally complex and ever growing, the Internet offers a and pictures, text reports, surveillance video and pictures,
fertile ground for data (web) mining principles. Web games, digital media, Computer Aided Design (CAD),
mining generally involves the discovery of user access software engineering data, Virtual Worlds (VW), text
patterns and useful knowledge from the structure of reports and memos (e-mail messages), www repositories
hyperlinks on web usage logs and mining, extraction and and military intelligence on the web [5-7]. Based on these,
integration of useful data, information and knowledge from web contents (data) appear in form of text, images, sounds
web page contents [1]. Due to its heterogeneity and lack of (audio), videos, animations, applications, e-services,
structure, the challenges confronting web data mining personal web pages, archived e-mail messages and more
include inaccuracy of the “mined document” and the time [8]. Generally, the more the contents of a website are
2016 International Journal of Computer Science Issues

dynamic, useful and interesting, the more is their patronage performance of method with particular types of data, the
and success. hardware and software facilities available and the size of
the dataset.
Data Mining
Data mining is the nontrivial extraction of implicit, Web Mining
previously unknown and potentially useful information Web mining is the application of data mining techniques on
from data in databases and its basic steps in an iterative the web data to solve the problem associated with
extracting useful information [13, 17-18]. Most of the
knowledge discovery process is presented in Figure 1 [9].
existing search engines lack the efficiency of providing
Data mining is applicable to any kind of information relevant and required information as the information on the
repository with different data, algorithms and approaches Internet increases [19-21]. The huge, diverse and dynamic
and is being put into use and studied for relational, object- nature of the Web has resulted in information overload and
relational, object-oriented and transactional databases as heightened the need for an intelligent software agent for
well as unstructured and semi-structured repositories such finding, sorting and filtering the available information [6,
as the www and spatial, multimedia, time-series and textual 22-23]. Web mining can be decomposed into resource
finding, information selection and preprocessing,
databases [10-11].
generalization and analysis [18]. Resource finding means
Knowledge the process of retrieving the data that is either online or
offline from the web sources like text, relational data and
Pattern semi structural data like XML. The information selection
Evaluation
and preprocessing step is any kind of transformation
processes of the original data retrieved in the IR process.
Machine learning or data mining techniques are used for
Data Mining generalization [17]. Based on which part of the Web to
. mine, web mining is divided into three areas of interest;
Task‐Relevant namely Web Content Mining (WCM), Web Structure
Data Mining (WSM) and Web Usage Mining (WUM) as shown
in Figure 2.
Selection and
Data Transformation
Warehouse Web Mining
Data Cleaning
Web Content Web Structure Web Usage

Data Integration
Mining Mining Mining
Databases
Figure 1: Data mining as the core of knowledge discovery process
Text & Hyperlink Web log records
Multimedia Structure
The iterative process of data mining consists of data documents
cleaning (or cleansing), data integration, data selection, data
transformation, data mining, pattern evaluation and Fig. 2: Web Mining Divisions
knowledge representation [12]. Data mining functionalities
include data characterization, association analysis, Web Data Extraction Technique (WDET)
classification, prediction, clustering, outlier analysis and A WDET provides a broad class of software applications
evolution and deviation analysis [13-14]. Existing data targeting at extracting information from the web. If the
mining tools include some traditional data mining programs
source is an HTML Web page, the extracted information
that are available in both Windows and UNIX operating
systems, dashboards and text-mining tools [15]. Data could consist of elements in the page as well as the full-text
mining techniques include artificial neural networks, of the page itself [24]. The extracted data might be post-
decision trees, the nearest-neighbor method, clustering, processed, converted in the most convenient structured
classification, association, sequential patterns and format and stored for further usage [25-29]. WDETs find
predictions among others [16]. extensive use in a wide range of applications including the
analysis of text based documents like e-mails, support
Data Clustering (DC) forums, technical and legal documentation and so on and its
With DC, logically similar information is physically stored
importance depends on the fact that a large (and steadily
together for minimal disk access and increased database
efficiency. The available DC methods are Probabilistic growing) amount of information is continuously produced,
Clustering (PC), Hierarchical Clustering (HC), Text-based shared and consumed online. A company can probe the
Clustering (TC), Graph-based Clustering (GC), Neural Web to acquire and analyze information about the activity
Network-based Clustering (NNC), Fuzzy Clustering (FC) of its competitors, identify the opportunities provided by
and Partitioning [5, 16-17]. The choice of a particular the market and learn from faults and successes [30].
method depends on the type of output desired, the known

Graph Theory The need to track and analyze the usage patterns of web
Graphs are mathematical constructs for representing objects users in the deep web motivated the research in [35]. The
or systems which contain structural (or relationship) research identified enormous effort, time wasting and
information and have been used in many domains, from avoidable extra monetary cost as part of the challenges
software engineering to artificial intelligence. Graphs are facing the Internet users in their bid to extract useful
more robust than typical vector representations as they can knowledge from the web. Extensible Markup Language
model structural information that is usually lost when (XML) format of web pages and XML document object
converting the original web document content to a vector model were used for information extraction. Correlation
representation. With graph, information such as the mining approach was used for finding the correlation
location, order of proximity or term of occurrence which is attributes in query interfaces while Jaccard measurement
discarded under the standard document vector technique was used to measure the degree of similarity
representation models is captured. A graph G is defined as between items. Though the proposed algorithm ably found
G=(V,E) where V is a set of nodes (also called vertices), correlated attributes in query interfaces with greater
and E the set of edges connecting the nodes. It is also accuracy and speed compared to some existing algorithms,
defined as a 4-tuple: G= (V, E, α, β), where V is a set of it however miss out with its trial by error method, due to
nodes (vertices), E V×V is a set of edges connecting the absence of standard or general way of finding the most
suitable measurement. Also, as a convectional vector
nodes, α: V→ ∑v is a function labeling the nodes, and β:
representation technique, vital information such as
V×V ∑e is a function labeling the edges. ∑v and ∑e are proximity of word occurrence and or the location of a word
the sets of labels that can appear on the nodes and edges, within a document were not captured thereby leading to
respectively. For brevity, G may be referred to as G= (V, E) loss of information.
by omitting the labeling functions [31-33]. The research work presented in [14] revealed how
increased redundancy and duplication in web pages result
2. Literature Review in indexing space and retrieval and removal time
Every Internet user desires satisfactory results from web complexity. This prompted the authors to develop a web
search engine in the sense that all the retrieved results are outlier mining system that offers speedy and accurate
relevant and all relevant documents are retrieved. This retrieval of information from structured and unstructured
implies that the web user is mostly satisfied when the web documents. Obtained results presented some vital
information retrieval system retrieves all and only the structural information such as the order and proximity of
relevant documents within a reasonable response time. terms occurrence and the location of word within a
Despite significant improvement on the existing search document as not captured by the system. The research in
techniques, web users still encounter some problems [36] was motivated by the need to facilitate knowledge-
relating to the retrieval of irrelevant documents from the based response to the user and also to discover hidden
web. Most of the early researches on web content mining patterns from the web. An agent based system was
focused on the retrieval of textual contents with the developed as a solution for mining semantic web contents
traditional text representation method and vector space and to provide context based knowledge oriented results to
model presenting several noticeable weak points of the user. The shortfall of the work is that in the process of
inability to capture text structure and the semantic discovering hidden patterns from the web, some
information of text content. unrequested information are retrieved, which may lead to
The authors in [34] developed a web content mining system information overload and time wastage. A composite graph
using a graph-based representation technique and the k- model and maximum common sub-graph-based technique
means-based clustering algorithm to overcome the problem for web document extraction is proposed in [37]. The
of vector space model. Although, the work presented some research focused on the development of a standard method
good results, its solution often converges to local minima. for representing web documents as graphs and graph-based
In [33], graph-theoretic method formed the platform for the classification of web documents. Tag and context sensitive
analysis of protein structural information on the basis that graph models on extracted web pages samples were used
sub-graph and maximum common sub-graph isomorphism alongside graph distance computations for similarity
algorithms from graph theory provide effective and comparison and measurements.
efficient way of identifying structural relationships between In [38], an optimal graph theoretic approach to data
objects. Although the platform justified extension of graph clustering was presented. The research used network flow
theory to other areas of machine learning base on its theory as data clustering technique and graph theoretic
sufficient identification of the sequence relationships approach to image segmentation with a view to handle
between biological macromolecules, it is however only maximum flows computation in an undirected graph. The
restricted to graph theory-based analysis for which three- proposed system suffers in its inability to handle moving
dimensional crystallographic or nuclear magnetic resonance images. A generalized graph-theoretic mesh optimization
(NMR) structures are available. The need for a system with model is proposed in [39]. Direct derivation of mesh
improved and optimal representation of web contents optimization model, primitives and multi-pole components
motivated the work of the authors in [31]. A hybrid web method and algebraic multi-grid principles based on
document representation methodology based on vector coarsening technique were used. With this model, heuristics
space and graph models was presented for classifying web assumptions of analogy of FEM stiffness and cutest
document contents. The calculation of the classification rate matrices made in the graph theoretic method which
for each candidate sub-graph is however too complicated required rigorous theoretical validation were used. The
for the methodology and the ensued solution is often not stability of the coarse/fine interpolation between meshes in
optimal.

Garlekin operator and the limits of the AMG-type nodes. The parameter for Equation 1 is derived from
coarsening can as well not be predicted. Equations (2) through (6)
| , |
3. Proposed GT AND GA-Based Technique , 1 | |,| |
(2)
The proposed system combines the strength of existing

The graph union-based distance measurement algorithm
works proposed in [34, 38-39]. Its conceptualization is
proposed in [40] is adopted as follows:
presented in Figure 3 with three phases; namely contents
extraction, preprocessing and database of mined data. | , |
Based on the user query, the content extraction phase uses , 1 | | | | |
(3)
, |
the search engine to extract and store raw web pages from
the Internet. The second phase uses graphic theoretic and With graph union, the denominator represents the size of
genetic algorithms to process the extracted web document the union of the two graphs in the set theoretic sense;
base on tag separation, stop word removal and stemming. specifically, adding the size of each graph (|G1| + |G2|) then
subtracting the size of their intersection (|mcs(G1,G2)|) leads
Graph Theoretic and Genetic Algorithms to the size of the union. This distance measure behaves
In order to design the graph theoretic model for web similarly to mcs. The use of graph union in the denominator
content mining, the graph similarity and the frequent sub is premised on the need for allowing for changes in the
graph extraction algorithms were used. The input of the smaller graph to exert some influence over the distance
algorithm is a set of training examples; a query instance q, measure, which does not happen with mcs. This
and a parameter k, defining the number of neighbors to use. measurement is a metric created in the interval [0,1]. A
The output is labeled to indicate the class of the query similar distance measure not normalized to the interval [0,
instance q. First, determine the closest training examples to 1] is presented as follows [41]:
q according to the distance. The second step is to select the
nodes that are closest to the center. In order to get the , | | | | 2| , (4)
distance between the clusters, the maximum common sub-
graph distance measure was used. The data for the nearest The distance based on the maximum common sub-graph
neighbours was determined using the method of graph and the minimum common super-graph is measured based
techniques to model web documents. on the formula [42]:
, | | | | | , (5)
Extracted Web Search Engine
Document
This distance measure is informed by the concept that the
maximum common sub-graph provides a "lower bound" on
Pre-Processing
the similarity of two graphs, while the minimum common
super-graph is an "upper bound". If two graphs are
Tag Separation User Query
identical, then both their maximum common sub-graph and
minimum common super-graph are the same as the original
Stop Word Removal graphs and |G1| = |G2| = |mcs(G1,G2) | = |mcs(G1,G2)|, which
Web Documents
leads to dmmcs(G1,G2)= 0. As the graphs become more
Stemming
dissimilar, the size of the maximum common sub-graph
decreases, while the size of the minimum common super-
graph increases. This in turn leads to increasing values of
Redundancy dmmcs(G1,G2). For two graphs with no maximum common
checking
Mined sub-graph, the distance becomes |mcs(G1,G2) | = (|G1| +
Documents |G2|). mmcs is a metric that does not produce values
normalized to the interval [0,1], unlike the mcs or wgu. If it
Redundancy holds that |mcs(G1,G2)| = |G1| + |G2| - |mcs(G1,G2) | G1,G2,
Removal then dMMCS(G1,G2)= |G1| + |G2| - 2|mcs(G1,G2)| is
Domain computed. This is much less computationally intensive than
Dictionary computing the minimum common super-graph. A version
of this distance measure normalized to [0, 1] is created as
Figure 3: Proposed System Architecture follows:
| , |
, 1 | |,|
(6)
|
The distance is measured using Equation 1.
|∑ , | With this algorithm, several parameters are defined to
, ∑ ∑
(1) control the properties of resulting cluster hierarchy. The
| | , | |
Maximum Terms Threshold (MTT) is the first parameter
mcs is the minimum common sub-graph, G1 is graph 1 and and it restricts the maximum number of vertices in the
G2 is graph 2. The measured distance is based on mcs and resulting graph representations of documents. There are two
to make use of the information held by the composite options with the first option as the default choice and
model, all measurements may be modified to incorporate involves the use of thirty of the most frequent terms in the
the vertex degree instead of simply taking the number of pages while the second option is to use the maximum
threshold terms that are most occurring. The second whole data set. For the k-cluster, the centers are derived by
parameter is the Minimum Pages Threshold (MPT), which obtaining the centers of that cluster and then determine the
is used to eliminate the clusters having fewer pages than the location of the new centers of the cluster which are the
maximum point assigned to them using a default value of 3. most optimum. To achieve this, the data items are
The other parameter is the Maximum Distance Threshold considered each at a time. The algorithm is then applied on
(MDT), which is used to restrict the expansion of the these data items, and the results examined. The new centers
hierarchy. Clusters whose difference in size from their are then determined and used. The determination of the new
parents is greater than MDT are not added to the hierarchy centers is based on the formula:
[43] and the default distance is two, which has the
capability to accept one new word to an existing phrase that … ∑∑ ∈ || || (8)
is one node and one side.
The Maximum Cluster Threshold (MCT) is another N is the number of data items, M is the number of clusters,
parameter and it limits the total size of the hierarchy. The xi is the data item i and mk is the cluster center k. I(x) = 1
hierarchy construction phase of the algorithm is stopped when x is true and 0 otherwise. For this reason, a faster way
once it has created MCT clusters or there are no candidate to calculate the centers is stated as follows:
graphs remaining. The default value is 50. Finally, there is
the Base Cluster Size Threshold (BCST) which limits the ∑ max || || , 0 (9)
size of the base clusters. No new base cluster is created if
its size is more than the BCST. The default BCST is three is the distance between data item xj and the center of
and is usually large enough to allow a phrase with two the cluster that is closest when using this clustering
terms, which are nodes connected by an edge as a base algorithm. A new cluster is then selected to be the input for
cluster. the algorithm. Before the genetic algorithm is applied on a
This algorithm has three basic steps; namely original web page, there has to be some pre-processing to be done.
hierarchy construction, document assignment and bottom- This involves searching all the pages that contain the terms
up clustering pruning. The first step is the initial hierarchy in the query of the user. A weighted web tool is applied and
construction. The candidate chart having the minimum size the genetic algorithm is designed and iterated to come up
for a graph G, │G│ is the sum of the quantity of edges, E with the optimal solution.
and the vertices, V and are shown by │V│+│E│ with this When a web user enters a list of key words, then a new
cluster taken as the cluster candidate. If there is a tie, one of document is created containing the documents with the key
the graphs is selected at random. The likely parents of a words and its frequency in the search. Through the
candidate in the hierarchy are then identified, so that any chromosome of a gene, each document is allocated a
parent that is defined here is the lowest. If the candidate of reference number. Chromosome is made up of sets of
the cluster has no parents cluster and if the size of the graph genes, gi such that gi [d1…..dn]. The number of genes in the
is not greater than or equal to the BCST, then the cluster
chromosome does not, at any one point, exceed the
should be added. If it is otherwise, then the cluster
maximum number of documents that are selected at
candidate is added to the hierarchy. The cluster candidate is
random. The chromosomes that are produced have varying
then removed from the set of candidate graph. After this, if
lengths. The solution generated is mostly between five and
the clusters are less than the MCT and a candidate graph is
a specified maximum number. The first chromosome, g1, is
still remaining; proceed to the initial original assignment
phase. As the nodes are added, the cluster in the hierarchy then generated and S=S g1. The loop is then iterated until
that has the smallest distance is determined according to the the chromosomes generated are equal to the length
mcs distance measure as follows: requested [44]. The first generation of chromosomes is
randomly generated and has a fixed number. In the
| , |
, (7) subsequent generations, two individuals are selected
| |,| |
randomly, and the one with a higher fitness value is
selected. In this algorithm, the crossover is used to bring
If there is 1 minimum distance, then this page has to be forth offspring from the existing population. It is mostly
skipped. The pages are assigned to the clusters that have a operated with a probability of 0.8. Two randomly selected
minimum distance as a native page. The inherited pages parents of different lengths were selected, and a point of the
continue to give up the cluster from the child to the parent crossover is also selected randomly. The first point to be
until the base cluster is reached. In the next phase, starting selected is based on the length of the parent and the other
with the lowest level in the hierarchy, all clusters at that point is also selected in accordance to the length of the
level are eliminated from the level that has fewer pages parent. The first offspring is obtained by switching the tail
than the MPT pages assigned to them. Given the new of the second parent from the first selected point [45]. The
hierarchy, all the pages are re-aligned from the deleted second offspring is obtained from the first parent from the
clusters. The orphaned clusters are fixed by updating the second point. Any duplicate gene is removed.
parent information. This is repeated until the top level is
reached. For each cluster, the longest simple paths are
displayed first. Isolated nodes are displayed as single terms. 4. System Implementation
If a cluster happens not to be a base one, then those on the
hierarchy are shown. The implementation environment is a window 7 operating
The basic step is an additional determination of the cluster system on a 2.4GHz Core i3 processor with 4GB RAM.
centers. This is a good way to determine the initial centers Import.io (web scraper) served as the frontend while PHP 6
for the clusters in the algorithm. Starting at the case of one and MySQL5 served as the backend. Web Scraping (also
cluster, the center of a cluster is set as the centroid of the termed Screen Scraping, Web Data Extraction or Web

Harvesting) is very useful in the extraction of large l is the left side pages, r is the right side pages while d is the
amounts of data from websites tabular or spreadsheet total dataset in the database.
format. Data displayed by most websites can only be The next factor, f2 is described by the ratio of the frequency
viewed using a web browser. Examples are data listings at of the left and the right side pages and the pages on the left
yellow pages directories, real estate sites, social networks, of the same user and is obtained from:
industrial inventory, online shopping sites, contact
databases and so on. Since most websites do not offer (11)
functionality for saving data and displaying on computer,
Web Scraping is therefore used to automate the tedious and Finally, the third factor, f3 is the ratio between the pages on
slow process of manually copying and pasting website the left side of a particular user and a page from the left
(browser) data to a local file. side together with the first page, p on the right and is
obtained from the equation:
Experimental Setup
A total of 2100 web pages were downloaded with various (12)
web data types. The downloaded pages were divided into
training and testing pages using the structured query The time duration, T for the computation of f1, f2 and f3 is
language (SQL) statements shown in Figure 4. The web obtained from:
data extraction is based on the use of a web scraping
software import.io on selected websites and the extracted T ∑ N (13)
web page is converted to comma separated values (csv)
format which serves as input to the next phase. At the data
preprocessing phase, the data obtained from different lu and Nu are the left side pages and the time for user u
sources such as HTML documents, browser logs are respectively. If obtained duration exceeds the threshold,
cleaned before processing and grouping in accordance to then the cross over process takes place and the total quality,
physical location. The input data to the preprocessing Qt is calculated based on the formula:
algorithm consists of the web pages accessed in a session
by a web user. The data is in a tree-like format and listed in ∑
Q (14)
order of access.
SQL statement for Training dataset represents the node in a tree i and l is the total number
SELECT v1.* FROM ( -- randomly divide members of the population into of nodes in a tree. The crossover stage is a process of
subgroups based on target classes
SELECT a.*, row_number() OVER (partition by {target column} ORDER BY interchanging the sequence of nodes in a particular tree.
ORA_HASH({case id column})) "_partition_caseid" FROM {input data} a This process deals with those trees that have the right
) v1, ( -- get the count of subgroups based on target classes fitness values above the threshold and the selected nodes
SELECT {target column}, move on to the mutation process. Mutation is an iterative
COUNT(*) "_partition_target_cnt"
FROM {input data} GROUP BY {target column} ) v2 process for node transformation and selection of
WHERE v1. {target column} = v2. {target column} chromosomes with the best fitness levels.
-- random sample subgroups based on target classes in respect to the
sample size
AND ORA_HASH(v1."_partition_caseid", v2."_partition_target_cnt"-1, 0) <=
Results
(v2."_partition_target_cnt" * {percent of training dataset} / 100) The extracted data is tested using three standard measures;
namely precision (p), applicability (a) and hit ratio (h),
SQL statement for Test dataset which are measured based on Equations 15, 16 and 17
SELECT v1.* FROM ( -- randomly divide members of the population into
respectively:
subgroups based on target classes
SELECT a.*, row_number() OVER (partition by {target column} ORDER BY
ORA_HASH({case id column})) "_partition_caseid" FROM {input data} a xy (15)
) v1, ( -- get the count of subgroups based on target classes yz (16)
SELECT {target column}, (17)
COUNT(*) "_partition_target_cnt"
FROM {input data} GROUP BY {target column}
) v2 The initial population for the real and generated data is 800
WHERE v1. {target column} = v2. {target column} and the obtained results for GA and training samples are
-- random sample subgroups based on target classes in respect to the presented in Table 1 with superior performance for the
sample size
AND ORA_HASH(v1."_partition_caseid", v2."_partition_target_cnt"-1, 0) > genetic algorithm.
(v2."_partition_target_cnt" * {percent of training dataset} / 100)
Table 1: Comparison of GA-generated data and
Figure 4: structured query language (SQL) training samples
Measure T1 (GA) T2 (Training data)
The fitness level of the data is then calculated from each Precision 94.4225 89.025
user based on the sum of three factors. The first factor, f1, is Applicability 100.0000 87.500
the ratio of the frequency of the data of a given web user Hit ratio 94.4225 89.000
when compared to all the other users in the database and is
obtained from the formula: For performance evaluation, three experiments were
conducted with notations F, J and K series for all the
(10) available web documents that can be represented with
graphs as well as truth value. The F-series originally
contained 93 web pages, each of which were subdivided

into four major groups; namely manufacturing, finance, representation of experimental values for Graphs, Random,
business and education. The J-Series contained 185 pages Euclidian and Jaccard methods with the same experimental
and had 10 classifications while the K-Series contained 400 conditions is shown in Figure 5. It is revealed vividly that
web pages in 20 categories. 800 pages were randomly the graph theoretic and genetic algorithm-based technique
selected from the original 2100 pages and the number of outperforms other techniques especially with increased
vectors was set to 40, which doubled the number of graph nodes. In other words, as the complexity of the web
categories in the K-Series experiment. The choice of this contents increases, other reviewed techniques could not
number of clusters is premised on the fact that it is the most match up with the proposed technique in the area of mutual
natural number based on the initial tests and observations. information index.
The number of maximum nodes per graph was set to be
higher to provide improved baseline for the results as 5. Conclusion
shown in Table 2.
With standard tools for web content mining, there is
Table 2: Performance of Graphs with increasing nodes opportunity for extracting only the relevant text from web
Max. Nodes/Graph AM (average) while unrelated textual noise like advertisements,
150 0.2218 navigational elements, contact and copyright notes are
120 0.2142 reliably suppressed. The reported research hybridized graph
90 0.2074
theoretic and genetic algorithm to formulate a web content
75 0.2045
mining technique for achieving this purpose. The new
60 0.1865
45 0.1758 technique provides timely search and discovery from large
30 0.1617 web datasets and experimental results had shown its
15 0.1540 superiority over other techniques. These suggest the new
5 0.1326 technique will be very useful in areas where knowledge
Each row in Table 2 provides results for 10 experiments discovery, web structure and web analytics are required. It
using the same data sample of 800. The variation in the is of note that the applicability of the new technique on
results is due to randomization in the first stage of the complex and large number of parameters has not been
algorithm. Previously obtained data were represented in the investigated.
graph for better visualization and with a 2.2 GHz processor,
it took 7 minutes to represent five nodes per graph.
Euclidian distance, for point (x,y) was also determined
with a view to measuring the vector distance metrics as
follows:
, ∑ (18)
xi and yi are the ith components of the x and y vectors

respectively. The cosine equivalent, of the distance is
obtained as follows:
∗
, 1 | |.|| ||
(19)
∗ is the dot operator, and || shows the magnitude of the

vector being considered. Comparison of the results obtained
from the graphs with those from other techniques is Figure 4.3: Comparison of graph with other techniques
presented in Table 3.
REFERENCES
Table 3: Comparison of Graph theoretic approach with
other techniques at an instance [1] Liu B., Structured data extraction: Wrapper generation and
Method AM (average) Web Data Mining, Editorial Issues on Web content Mining.
SIGKDD Explorations –Vol. 6, 2005, pp. 363 - 423
Graphs (current study) 0.222
[2] Abdelhakim H., Khentout C. and Djoudi M., Overview of
Extended Jaccard Similarity [34] 0.184 Web Content Mining Tools, The International Journal of
Pearson Correlation [34] 0.178 Engineering and Science (IJES), Vol. 2, 2013.
Cosine Measure [31] 0.178 [3] Marghny M. H. and Ali A. F., Web mining based on genetic
Random [31] 0.066 algorithm, Proceedings of AIML ‘05 Conference, Cairo,
Euclidean [31] 0.046 Egypt, 2005, pp 19-21
[4] Ammar S. A., Enhancing recall and precision of web-Search
using genetic algorithm, A thesis submitted for the degree of
Since larger graphs hold more data, the mutual information Doctor of Philosophy, School of Information Systems
is seen to increase as the graph increases in size. The Computing and Mathematics, Brunel University, UK, 2012
random baseline was used to provide a basis for [5] Zaiane R. O., Introduction to Data Mining: Principles of
comparison in the experiment. The Jaccard means was knowledge discovery in databases, 1999.
[6] Han J. and Kamber M., Data mining: concepts and
based on the Jaccard similarity and the cosine and Pearson
techniques, Morgan Kaufmann Publishers Inc., San
measures were omitted for improved clarity. The graphical Francisco, CA, USA, 2000

[7] Imielinski T. and Mannila H., A database perspective on extraction methodology, Companion, Vol. 10, 2012, pp 267-
knowledge discovery, Communications of ACM, Vol. 39, 270.
pp. 58-64. [30] Chen H., Chau M. and Zeng D., Spider: a tool for
[8] Rosenfeld L. and Morville P., Information architecture for competitive intelligence on the web, Decision Support
the World Wide Web, 1st edition, CA, 1998 System, Vol. 17, No. 34, 2002
[9] Callan J., System and method for filtering a document [31] Marcov A. Last M. and Kandel A., Model-Based
strem”, US Press, Patient 6.105.023, 2000 Classification of Web Documents Represented by Graphs,
[10] Chen M. S., Han J. and Yu P. S., Data mining: An overview Proceedings of WEBKDD'06, Philadelphia, Pennsylvania,
from a database perspective, IEEE Trans. Knowledge and USA, 2006
Data Engineering, Vol. 8, 1996, pp 866-883 [32] Cormen T. H., Leiserson C. E., Rivest R. L. and Stein C.,
[11] New York Stock Exchange, 2000. Available at The algorithms of Kruskal and Prim: Introduction to
http://www.ecgi.org/codes/documents/nyse_cgreport_23sep Algorithms”, 3rd edition. MIT Press, Vol. 23, No. 2, 2009,
2010_en.pdf. Accessed October 12, 2013. pp. 631-638.
[12] Piatetsky-Shapiro G., Fayyad U. M. and Smyth P., From [33] Artymiuk P. J., Spriggs R. V. and Willett P., Graph theoretic
data mining to knowledge discovery, An overview. In U.M. methods for the analysis of structural relationships in
Fayyad, et al. (eds.), Advances in Knowledge Discovery and biological macromolecules, Journal of the American Society
Data Mining, AAAI/MIT Press, 1996, pp. 1-35 for Information Science and Technology, Volume 56, 2005,
[13] Sivaramakrishnan J. and Balakrishnan V., Web Mining pp 518 – 528
Functions in an Academic Search Application, Informatica [34] Schenker A., Last M., Bunke H., and Kandel A., Graph
Economica, Vol. 13, No. 3, 2009. Theoretic Techniques for Web Content Mining, PhD Thesis,
[14] Poonkuzhali G., Sarukesi K. and Uma G. V., Web Content College of Engineering, University of South Florida, 2003
Outlier Mining Through Mathematical Approach and Trust [35] Shoreh A. and Mohammad D. J., Deep Web Content
Rating, Recent Researches in Applied Computer and Mining, The Journal of World Academy of Science,
Applied Computational Science, 2012. Engineering and Technology, 2009, pp. 49.
[15] Silltow J., Data Mining 101: Tools and Techniques”, paper [36] Sighn A., Agent Based Framework for Semantic Web
presented at The Institute of Internal Auditors (IIA), 247 Content Mining, International Journal of Advancements in
Maitland Avenue, Altamonte Springs, Florida U.S.A., 2006 Technology (IJoAT), Vol. 3, No. 2, 2012. Available at
[16] Colet E., Clustering and Classification: Data Mining http://ijict.org/. Accessed September, 2014.
Approaches”. Virtual Gold Incorporated, 2002 [37] Kaushik M. and Phukon S., A composite Graph Model for
[17] Galeas P., Web Mining, 2005. Available at: Web Document and the MCS Technique, International
http://www.galeas.de/webmining.html, Accessed January, Journal of Information Technology and Knowledge
2016. Management, Vol. 4, No.1, 2012, pp. 211-215
[18] Cooley R., Mobasher B. and Srivastava J., Web mining: [38] Zhenyu W. and Richard L., An Optimal Graph Theoretic
information and pattern discovery on the World Wide Web. Approach to data Clustering, theory and its application to
Proceedings of 9th IEEE International Conference, pp. 558 – Image Segmentation”, IEEE Transactions on Pattern
567, 1997 Analysis and Machine Intelligence, Volume 15, No. 11,
[19] Arvind K. S. and Gupta P. C., Exploration of efficient 1993.
methodologies for the improvement in web mining [39] Andrey A. M., A Generalized Graph-Theoretic Mesh
techniques - A survey, International Journal of Research in Optimization Model, Proceedings of the 26th International
IT & Management, Vol. 1, No. 3, 2011 Meshing Roundtable, South Lake Tahoe, 2005
[20] Arvind K. S. and Gupta P. C., Study and Analysis of Web [40] Wallis W. D., Shoubridge P., Kraetzl M. and Ray D., Graph
Content Mining Tools to Improve Techniques of Web Data distances using graph union. Pattern Recognition Letters,
Mining, International Journal of Advanced Research in Vol. 22, No. 6, 2001, pp 701-704.
Computer Engineering and Technology (IJARCET), Vol. 1 [41] Bunke H., Recent Development in graph matching.
,No. 8, 2012 Proceedings of 15th International Conference on Pattern
[21] Baumgartner R., Gatterbauer W. and Gottlob G., Web data Recognition, Vol. 2, 2000, pp 117-124.
extraction system: Encyclopedia of Database Systems, [42] Mirtha-Lina F. and Gabriel V., A graph distance metric
2009, pp 3465-3471. combining maximum common subg--raph and minimum
[22] Kosala R. and Blockeel H., Web mining research: A common super-graph, Pattern Recognition Letters, Vol. 22,
survey,” SIGKDD Explorations: Newsletter of the Special 2001, pp 753-758.
Interest Group on Knowledge Discovery and Data [43] Sandhya Chaturvedi M. and Shrotriya A., Graph Theoretic
(SIGKDD) Mining, ACM, Vol. 2, 2000 Techniques for Web Content Mining. The International
[23] Gore M. M. and Mishra A. K., Algorithm for Data Mining. Journal of Engineering and Science, Vol. 2, No. 7, 2013, pp.
Proceedings of Winter School on Data Mining, Allahabad, 35-41.
India, 2001 [44] Vikrant S. and Thakur R. S., GA Based Model for Web
[24] Ferrara E., De-Meob P., Fiumarac, G. and Baumgartnerd, Content Mining”, International Journal of Computer
R., Web Data Extraction, Applications and Science Issues (IJCSI), Vol. 10, No. 2, No 3, 2013
Techniques: A Survey, 2014. Available at [45] Ammar S. A. and Shaker R., Genetic Algorithm Mining for
http://www.sciencedirect.com/science/article/pii/S09507051 HTML Documents, School of Information Systems
14002640, Accessed on September, 2015. Computing and Mathematics (SISCM), Brunel University,
[25] Irmak U. and Suel T., Interactive wrapper generation with UK, 2009
minimal user effort”. Proceeding of 15th International
Conference on World Wide Web, Edinburgh, Scotland,
2006, pp 553-563
[26] Wang P., Hawk W. and Tenopir C., Users' interaction with
World Wide Web resources: an exploratory study using a
holistic approach, Information Processing Management,
2000, pp 229 - 251,
[27] Furche T., Gottlob G., Grasso G., Gunes O., Guo X.,
Kravchenko A., Orsi G., Schallhart C., Sellers A. J. and
Wang C., Domain-centric, intelligent, automated data

Ijcsi 13 6 68 75

Uploaded by

Copyright:

Available Formats

Ijcsi 13 6 68 75

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ijcsi 13 6 68 75

Uploaded by

Copyright:

Available Formats

IJCSI International Journal of Computer Science Issues, Volume 13, Issue 6, November 2016

ISSN (Print): 1694-0814 | ISSN (Online): 1694-0784

Graph Theoretic and Genetic Algorithm-Based Model

2Department of Computer Science, Federal University of Technology,

Abstract taken to mine useful information. These challenges

2016 International Journal of Computer Science Issues

Web Content Web Structure Web Usage

2016 International Journal of Computer Science Issues

2016 International Journal of Computer Science Issues

The proposed system combines the strength of existing

2016 International Journal of Computer Science Issues

2016 International Journal of Computer Science Issues

xi and yi are the ith components of the x and y vectors

∗ is the dot operator, and || shows the magnitude of the

2016 International Journal of Computer Science Issues

2016 International Journal of Computer Science Issues

You might also like