Mathematical and Computational Methods in Science and Engineering
Bio-Inspired Fuzzy Expert system for Mining Big data
Mr. A. SENTHIL KARTHICK KUMAR1 , Dr. M. THANGMANI2 & Dr. A.M.J MOHAMED
ZUBAIR RAHMAN3
1
Assistant Professor, Department of Computer Applications, Nehru Institute of Information
Technology & Management, Coimbatore-641 105, Tamilnadu, India.
2
Assistant Professor, Department of Computer Science, Kongu Engineering College,
Perundurai -638 052, Erode District, Tamilnadu, India.
3
Principal, Al-Ameen Engineering College, Erode, Tamilnadu, India. Pin code-638104
[email protected],
[email protected],
[email protected]
Abstract: - The increase in number of documents worldwide increases the difficulty for classifying those
documents according to these needs. Cluster availability of large quantity of text documents from the World
Wide Web and business document management forms has made the dynamic separation of texts into new
categories as a very important task for every business intelligence systems. But, present text clustering
algorithms still suffer from problems of practical applicability. Recent studies have shown that, in order to
improve the performance of document clustering, ontologies are useful. Expert system (ontology) is nothing but
the conceptualization of a domain into an individual identifiable format, but machine-readable format
containing entities, attributes, relationships and axioms. By analyzing all types of techniques for document
clustering, a clustering technique depending on Genetic Algorithm (GA) is determined to be better. The
evaluation result shows that the proposed approach is very significant in clustering the documents in the
distributed environment.
Key-Words: - Fuzzy ontology, clustering, distributed clustering, Peer to peer network.
next generation; additionally introducing the
variations in the new generation composition with
the help of cross over and mutation function. GA [3]
is a famous technique for handling complex search
problems through implementing an evolutionary
stochastic search because GA can be very
effectively applied to various challenging
optimization problems.
1 Introduction
Genetic Algorithm (GA) is considered to provide
better clustering results. The convergence time for
the usage for GA is more and also the number of
iterations required for GA is more when compared
to other techniques. For enriching the performance
measure, in this paper, the fuzzy ontology is applied
to the database to reduce the convergence time and
number of iterations before using GA. The usage of
fuzzy ontology will provide better classification of a
large, vague database. This has motivated the usage
of fuzzy ontology and GA for clustering. Hence, in
this approach, fuzzy ontology is combined with the
GA to yield better classification accuracy for large
databases.
In this paper, ontology [1] is introduced as a
modeling technology for structured metadata
definition within document clustering system.
Documents can be clustered with the metadata
obtained using the Genetic Algorithms (GAs) [2].
GA is a search technique based on natural genetic,
selection and merging of survival of the fittest with
structured interchanges. It conserves the attributes
of finest exponents of a generation for use in the
ISBN: 978-960-474-372-8
2. Related Works
Lena Tenenboim et al, [4] proposed ontology based
classification for document clustering. The author
recommended classification of news items in an
ePaper, a prototype system of a future personalized
newspaper service on a mobile reading device. The
ePaper system comprises news items from different
news suppliers and distributes to each subscribed
user a personalized electronic newspaper, making
use of content-based and collaborative filtering
techniques. As classical Euclidean distance metric
could not create a suitable separation for data lying
in manifold, a GA based clustering method based on
geodesic distance measure was proposed by Gang
Li et al, [5]. In the proposed method, a prototype-
222
Mathematical and Computational Methods in Science and Engineering
Muflikhah et al. [13] proposed a document
clustering based on concept space and cosine
similarity measurement. This technique aims at
incorporating the information retrieval and
document clustering into concept space approach. It
is known as Latent Semantic Index (LSI) because it
uses Singular Vector Decomposition (SVD) or
Principle Component Analysis (PCA). Its objective
is to decrease the matrix dimension by identifying
the pattern in document collection with reference to
the terms. Affinity-based similarity measure for
Web document clustering is presented by Shyu et
al., [14]. Document clustering is extended into Web
document clustering by establishing affinity based
similarity measure, It makes use of the user access
patterns in finding the similarities among Web
documents through a probabilistic model. Various
experiments are conducted for evaluation with the
help of real data set. The experimental results
illustrate the fact that similarity measure
outperforms the cosine coefficient and the Euclidean
distance technique under various document
clustering techniques.
ELdesoky et al., [15] gave a similarity measure
for document clustering based on topic phrases. In
the conventional vector space model (VSM),
researchers have used unique word available in the
document set as the candidate feature. Currently,
phrase based informative feature is considered
because it contributes to enhancing the document
clustering accuracy and effectiveness. Similarity
measure of the traditional VSM is evaluated by
considering the topics phrases of the document as
the comprising terms instead of the conventional
term. Thangamani et. al examined document
clustering [16,17] in individual and peer to peer
environment and also developed the system for
automatic extraction and classification of document
using multi domain ontology.
A document clustering method based on
hierarchical algorithm with model clustering is
presented by Haojun et al., [18]. It analyzes and
makes use of cluster overlapping to design cluster
merging criterion. Document clustering with fuzzy
c-mean algorithm is proposed by Thaung et al.,
[19]. Most traditional clustering technique allocates
each data to exactly single cluster, therefore creating
a crisp separation of the data provided. However,
fuzzy clustering permits for degrees of membership
to which data fit into various clusters. Xindong et.
al [20] investigated as data mining techniques can
applied to big data set for information extraction.
based genetic illustration is used, where every
chromosome is a sequence of positive integer
numbers that indicate the k-medoids.
Casillas et al, [6] also put forth a concept of
document clustering using GA. It deals with
document
clustering
that
computes
an
approximation of the optimum k value and resolves
the best clustering of the documents into k clusters.
It is experimented with sets of documents that are
the output of a query in a search engine. Andreas et
al, [7] advocated text data clustering technique. Text
clustering, usually, involves clustering in a high
dimensional space that appears complex with regard
to all virtual practical settings. Additionally, a
scrupulous clustering outcome is provided.
Word sets based document clustering algorithm
for large datasets was proposed by Sharma et al.,
[8]. Document clustering is a significant tool for use
in search engines and document browsers. It
facilitates the user to have a better overall
observation of the data available in the documents.
There is also a strong requirement for hierarchical
document clustering [9] where clustered documents
can be browsed based on the increasing specificity
of topics. Frequent Itemset Hierarchical Clustering
(FIHC) is used for hierarchical grouping of text
documents. This technique does not provide
consistent clustering results when the number of
frequent sets of terms is large. In this paper, the
authors proposed Wordsets-based Clustering
(WDC), an efficient clustering technique based on
closed words sets. WDC makes use of hierarchical
technique to cluster text documents having common
words.
Cao et al., [10] provided fuzzy named entitybased document clustering. Conventional keywordbased document clustering methods have
restrictions because of simple treatment of words
and rigid partition of clusters. Named entities are
introduced as objectives into fuzzy document
clustering, the important elements of defining
document semantics and in many cases are of user
concerns. Zhang et al., [11] gave clustering
aggregation based on GA for documents clustering.
A technique based on GA for clustering aggregation
difficulty, named as GeneticCA, is provided to
approximate the clustering performance of a
clustering division. In this case, clustering precision
is defined and features of clustering precision are
considered.
Web document clustering using
document index graph is put forth by Momin et al.,
[12]. Document clustering methods are generally
based on single term examination of document data
set. To attain more precise document clustering,
more informative features like phrases are essential.
ISBN: 978-960-474-372-8
223
Mathematical and Computational Methods in Science and Engineering
3. Expert system for
textual clustering
Table 2 Fuzzy formal context for table 1
with T =0.5
distributed
Initially, ontology generation using fuzzy logic is
implemented to the database containing a large
amount of documents. This technique generates the
ontology for the given database. With this ontology,
the next step is the application of GA. GA is used
for clustering the documents in the database with the
help of ontology generated by fuzzy logic technique.
The combination of expert system generation using
Fuzzy Logic and GA helps to increase the accuracy
of clustering. It consists of the following modules.
Formal concept analysis using fuzzy: In fuzzy
formal concept analysis integrates fuzzy logic into
formal concept analysis to represent vague data. A
fuzzy formal context shown in Table 1 consists of
three objects which denote three documents. Those
objects are named as D1, D2 and D3. Moreover, it
has three attributes such as Data Mining, Clustering
and Fuzzy Logic indicating the three titles. A
membership value between 0 and 1 denotes the
relationship between an object and an attribute. To
remove the relationships that have low membership
values, a confidence threshold T is introduced.
Table 2 represents the fuzzy formal context
provided in Table 1 with confidence threshold T as
0.5. Usually, the attributes of a formal concept can
be considered as the description of the concept.
Thus, the relationships between the object and the
concept must be the separation of the relationships
between the objects and the attributes of the
concept. A membership value in fuzzy formal
context denotes all the relationship between the
object and an attribute. Then based on fuzzy theory,
the intersection of these membership values must be
the minimum of these membership values. Figure 1
one shows the automatic generation of expert
system for analyzing distributed textual clustering.
D1
D2
D3
D1
D2
D3
ISBN: 978-960-474-372-8
Clustering
0.3
0.75
0.25
Clustering
Fuzzy
Logic
0.75
1
-
0.75
-
0.5
0.75
Expert system generation: While the formal
concepts are also generated mathematically, distinct
formal concepts are created on the basis of the
difference in terms of attribute object and the
traditional concept lattice. This produces the effect
of concepts as interpreted by humans. Based on this
observation, a cluster formal concept is infused into
conceptual clusters of fuzzy conceptual clustering.
Class Mapping: In this process, the extent and
intent of the fuzzy context are mapped into the
extent and intent classes of the ontology. It requires
supervised training to name the label for the extent
class. Keyword attributes can be represented by
appropriate names and they are used to label the
intent class names also.
Taxonomy relation generation: With concept
hierarchy in place, this phase produces the intent
class of the ontology as a hierarchy of classes. The
step can be considered as an isomorphic mapping
from the concept hierarchy into taxonomy classes of
the ontology. Non-taxonomy relation generation:
This step involves generating the similarities among
the extent class and intent classes with no hierarchy
between classes. The generation will mean an
equivalent class with no sub class or super class.
Instances generation: In this process, instances for
the extent class are generated. Each instance
indicates an object in the initial fuzzy context.
Depending on the data existing on the fuzzy concept
hierarchy, instances attributes are automatically
furnished with suitable values. For example, each
instance of the class document, related to an actual
document, will be associated with the appropriate
research areas. After the ontology is generated, GA
is used to cluster the documents. The usage of
ontology helps in determining the best classification
for clustering using GA.
Table 1 Fuzzy formal context
Data
Mining
0.75
1
0.25
Data
Mining
Fuzzy
Logic
0.5
0.25
0.75
224
Mathematical and Computational Methods in Science and Engineering
will effective. Also system can be applied in
cloud environment with e-Learning activities.
Fuzzy concept
Lattice
Class
Mapping
Taxonomy
Relation
Generation
References:
Ontology
Extent &
Intent
classes
[1] Andreas Hotho., Alexander Maedche. and
Steffen
Staab.,
"Ontology-based
Text
Document Clustering".
[2] Banerjee, A. and Louis, S.J., "A Recursive
Clustering Methodology using a Genetic
Algorithm", IEEE Congress on Evolutionary
Computation, 2007, Pp. 2165-2172.
[3] Murthy, C.A. and Chowdhury, N., “In Search
of Optimal Clusters using Genetic Algorithms”,
Pattern Recognition Letters, 1996, Pp. 825–
832.
[4] Lena Tenenboim., Bracha Shapira. and Peretz
Shoval., "Ontology-Based Classification of
News in an Electronic Newspaper",
International Book Series Information Science
and Computing, 2008, Pp: 89-98.
[5] Gang Li., Jian Zhuang., Hongning Hou. and
Dehong Yu., "A Genetic Algorithm based
Clustering using Geodesic Distance Measure",
IEEE International Conference on Intelligent
Computing and Intelligent Systems, 2009, Pp:
274 – 278.
[6] Casillas, A., Gonzalez de Lena, M.T. and
Martínez, R., "Document Clustering into an
Unknown Number of Clusters Using a Genetic
Algorithm", Lecture Notes in Computer
Science, Vol. 2807, 2003, Pp. 43-49.
[7] Andreas Hotho., Alexander Maedche. and
Steffen
Staab.,
"Ontology-based
Text
Document Clustering", Journal on Kunstliche
Intelligenz, Vol. 4, 2002, Pp. 48-54.
[8] Sharma, A. and Dhir, R., "A Wordsets based
Document Clustering Algorithm for Large
datasets",
Proceeding
of
International
Conference on Methods and Models in
Computer Science, 2009.
[9] Koller, D. and Sahami, M., “Hierarchically
Classifying Documents using Very Few
Words”, Proceedings of the 14th International
Conference on Machine Learning (ML), 1997,
Pp. 170-178.
[10] Cao, T.H., Do, H.T., Hong, D.T. and Quan,
T.T.; "Fuzzy Named Entity-Based Document
Clustering", IEEE International Conference on
Fuzzy Systems, 2008, Pp. 2028 – 2034.
[11] Zhenya Zhang., Hongmei Cheng., Shuguang
Zhang., Wanli Chen. and Qiansheng Fang.,
"Clustering Aggregation based on Genetic
Algorithm for Documents Clustering", IEEE
Ontology
Hierarchical
Ontology
Classes
Hierarchi
cal
Classes
Hierarchical Pattern
of Conceptual
Clusters
NonTaxonomy
Relation
Generation
Ontology
Relation
Classes
Instance
Generation
Ontology
Hierarchical
Classes
Figure.1: Expert system for distributed document
clustering
4. Experiment discussions
This technique avoids getting stuck into a local
maximum from which one cannot escape reaching a
global maximum. This is one of the main benefits of
GA in opposition to the conventional search
techniques as the gradient technique. Another
advantage is the utility of GA for real time
applications, in spite of its inability to offer the
optimal solution to the problem. However, it
provides almost a better solution in a shorter time,
including complex problems.
5. Conclusion and future work
In this work, expert system is created with help
of bio inspired method for large data. In further
investing the expert system to be applied to the
different dataset to test how much this system
ISBN: 978-960-474-372-8
225
Mathematical and Computational Methods in Science and Engineering
About Authors:
Congress on Evolutionary Computation, 2008,
Pp. 3156 – 3161.
[12] Momin, B.F., Kulkarni, P.J. and Chaudhari, A.,
"Web Document Clustering Using Document
Index Graph", International Conference on
Advanced Computing and Communications,
2006, Pp. 32 – 37.
[13] Muflikhah, L. and Baharudin, B., "Document
Clustering Using Concept Space and Cosine
Similarity
Measurement",
International
Conference on Computer Technology and
Development, Vol.1, 2009, Pp. 58-62.
[14] Shyu, M.L., Chen, S.C., Chen, M. and Rubin,
S.H., "Affinity-based similarity measure for
Web document clustering", IEEE International
Conference on Information Reuse and
Integration, 2004, Pp. 247 – 252..
[15] ELdesoky, A.E., Saleh, M. and Sakr, N.A.,
"Novel Similarity Measure for Document
Clustering based on Topic Phrases",
International Conference on Networking and
Media Convergence, 2009, Pp. 92-96.
[16] Thangamani .M and Thangaraj .P,“Survey on
Text Document Clustering”, International
Journal of Computer Science and Information
Security, Vol.8(4),2010.
[17] Thangamani.M and Thangaraj.P “Effective
fuzzy semantic clustering scheme for
decentralized network through multidomain
ontology model”, International Journal of
Metadata,
Semantics
and
Ontologies,
Interscience Vol.7, Issue 2, 2012, pp.131-139,
Interscience publication
[18] Haojun Sun., Zhihui Liu. and Lingjun Kong.,
"A Document Clustering Method Based on
Hierarchical
Algorithm
with
Model
Clustering", 22nd International Conference on
Advanced Information Networking and
Applications, 2008, Pp. 1229 – 1233.
[19] Thaung Win. and Lin Mon., "Document
clustering by fuzzy c-mean algorithm", 2nd
International Conference on Advanced
Computer Control (ICACC), 2010, Pp.239 –
242.
[20] Xindong Wu, Fellow, IEEE, Xingquan Zhu,
Senior Member, IEEE, Gong-Qing Wu, and
Wei Ding, Senior Member, IEEE, “Data
Mining with Big Data”, IEEE Transactions On
Knowledge And Data Engineering, Vol. 26,
No. 1, 2014, Pp. 97- 107.
ISBN: 978-960-474-372-8
A.Senthil Karthick Kumar is a.
Research Scholar in Bharathiar
University,
Coimbatore.
He
completed his B.Sc in Information
Technology from Madurai Kamaraj
University in 2003. Did his MCA
from Bharathiar University in 2006;
Completed his M.Phil in Computer Science in 2009
and E.M.B.A in Human Resource Management,
from MS University 2012. Prior to joining in
NIITM he worked for 3years as a Human Resource
Executive (Technical) in various companies like
Perot Systems, Bangalore. Currently he is working
as an Assistant Professor in Nehru Institute of
Information
Technology
and
Management,
Coimbatore Affiliated to Anna University. He
enrolled his Life time Membership with ISTE,
Member in CSI and IAENG. He has published and
presented around 10 papers in National and
International level Seminars and Journals. His area
of interest in research includes Cloud computing,
Software Engineering, Data mining and E-learning.
Dr. M. Thangamani completed her
B.E., from Government College of
Technology, Coimbatore, India. She
completed her M.E., and PhD
(Computer Science and Engineering)
from Anna University, Chennai,
India. Currently, she is working as
Assistant Professor in the Department of Computer
Science and Engineering, Kongu Engineering
College, Tamil Nadu, India. She has published 20
articles in International journals and presented
papers in 46 National and International conferences.
She has published 11 books for polytechnic colleges
and also guided many UG projects. She has
delivered more than 30 Guest Lectures in reputed
engineering colleges on various topics. She has
organized many self supporting and sponsored
National Conference and Workshop in the field of
Data mining and Cloud computing. She also
seasonal reviewer in IEEE Transaction on Fuzzy
System, International journal of advances in Fuzzy
System and Applied mathematics and information
journals. She is also the editorial member for many
International Journals and organizing chair for
International conferences in India and other
countries. Her research interests include Data
mining; Cloud computing, Ontology development,
Web Services and Open Source Software.
226
Mathematical and Computational Methods in Science and Engineering
Dr. A.M.J Mohamed Zubair Rahman,
Principal, Al-Ameen Engineering
College, Erode. He is a Person with
22 Years of Teaching Experience and
He was awarded with Ph.D from
Anna University, Chennai in the year
2009. Add on to his academics
excellence he completed his M.S. Software Systems
from BITS-Pilani in the year 1996, He completed
his B.E Computer Science Engineering from IRTT
in 1989, further in continuation of his education he
did his M.E Computer Science Engineering from
Bharathiar University in 2002. To his credit he has
attended several National and International
Seminars and presented more than 20 papers in
various conferences. He enrolled his Life time
Membership with ISTE, and Member in CSI. He has
published and presented around 20 papers in
National and International Journals. His area of
interest in research includes Data mining, Network
Security, Software Engineering and E-learning.
ISBN: 978-960-474-372-8
227