R.Anushya et al, International Journal of Computer Science and Mobile Computing, Vol.7 Issue.12, December- 2018, pg. 239-248
Available Online at www.ijcsmc.com
International Journal of Computer Science and Mobile Computing
A Monthly Journal of Computer Science and Information Technology
ISSN 2320–088X
IMPACT FACTOR: 6.017
IJCSMC, Vol. 7, Issue. 12, December 2018, pg.239 – 248
A Similarity Measure for Documents
Using Clustering Technique
R.Anushya1, A.Linda Sherin2, A.Finny Belwin3, Dr. Antony Selvadoss Thanamani4
¹Research Scholar Department of Computer Science & Bharathiar University, India
²Research Scholar Department of Computer Science & Bharathiar University, India
3
Research Scholar Department of Computer Science & Bharathiar University, India
4
Professor and Head Department of Computer Science NGM College, Pollachi, India
1
[email protected]; 2
[email protected]; 3
[email protected]; 4
[email protected]
Abstract— Text clustering is a critical use of information mining. It is worried about gathering comparable content
archives together. Content report grouping assumes a vital job in giving natural route and perusing systems by sorting
out a lot of data into few important clusters. Grouping technique needs to implant the reports in an appropriate similitude
space. In this paper we look at four prominent similitude measures: cosine similarity, Jaccard similarity, Euclidean
distance and Correlation Coefficient related to various sorts of vector space portrayal (Boolean, term recurrence and
reverse report recurrence) of archives. Clustering of archives is performed utilizing summed up k-Means; a Partitioned
constructed grouping strategy in light of high dimensional inadequate information speaking to content reports.
Execution is estimated against a human-forced arrangement of Topic and Place classes. We led various tests and utilized
entropy measure to guarantee factual noteworthiness of results. Cosine, Pearson relationship and Jaccard similitude rise
as the best measures to catch human categorization conduct, while Euclidean measures perform poor.
Keywords- Clustering, Jaccard similarity, Cosine similarity, Euclidean measure, Correlation coefficient, K-means.
I. INTRODUCTION
Today, with the quick advancements in innovation we can amass colossal measures of
information of various types. Information mining developed as a field worried about the
extraction of valuable learning from information [1]. Information mining procedures have
been connected to illuminate an extensive variety of true issues. Clustering is an unsupervised
information mining procedure where the names of information objects are obscure. It is the
activity of the clustering system to recognize the order of information protests under
examination. Clustering can be connected to various types of information including content.
When managing literary information, articles can be reports, sections, or words [2]. Content
clustering alludes to the way toward gathering comparable content records together. The
issue can be detailed as pursues: given an arrangement of reports it is required to partition
them into various gatherings, with the end goal that archives in a similar gathering are more
like each other than to records in different gatherings. There are numerous uses of content
grouping including: archive association and perusing, corpus outline, and record arrangement
clustering has been proposed for use in perusing an accumulation of reports [3] or in sorting
out the outcomes returned by a web search tool because of client's question [4] or help clients
© 2018, IJCSMC All Rights Reserved
239
R.Anushya et al, International Journal of Computer Science and Mobile Computing, Vol.7 Issue.12, December- 2018, pg. 239-248
rapidly recognize and centre around the pertinent arrangement of results. Client remarks are
grouped in numerous online stores, for example, Amazon.com to give cooperative proposals.
In communitarian bookmarking or labelling, groups of clients that share certain
characteristics are recognized by their comments. Archive clustering has likewise been
utilized to consequently produce Hierarchical groups of reports [5]. This paper is composed
as pursues. The section 2 manages the related work in content report grouping; section 3
depicts the record portrayal utilized in the trials. Section 4 discuss about the likeness
measures and their semantics. Section 5 displays the K-means grouping calculation and
Section 6 clarifies experiment settings, assessment methodologies, results and investigation
and Section 7 finishes up and examines future work.
II. RELATED WORK
Text Clustering is one of the critical uses of information mining. In this section, we audit a
portion of the related work in this field.
Luo et al. [3] utilized the ideas of record neighbours and connections with the end goal to
improve the execution of k-means and bisecting k-means clustering. Utilizing a couple
shrewd closeness work and a given similitude edge, the neighbours of a record are the reports
that are viewed as like it. A connection between two records is the quantity of regular
neighbours. The ideas were utilized in the choice of introductory group centroids and in
report closeness estimating.
Many grouping systems have been proposed in the writing. Clustering calculations are
fundamentally ordered into Hierarchical and Partitioning strategies [2, 3, 4, 5]. Various
levelled clustering strategy works by gathering information objects into a tree of groups [6].
These strategies can additionally be arranged into agglomerative and disruptive Hierarchical
grouping relying upon whether the Hierarchical deterioration is shaped in a base up or topdown design. K-means and its variations [7, 8, 9] are the most notable parcelling techniques
[10].
Bide and Shedge proposed a clustering pipeline to enhance the execution of k-means
grouping. The creators received a partition and-overcome way to deal with clusters reports in
the 20 Newsgroup dataset Documents were isolated into gatherings where pre processing,
highlight extraction, and k-means clustering were connected on each gathering. Report
similitude was computed utilizing the cosine comparability measure. The proposed
methodology accomplished better outcomes when contrasted with standard k-means as far as
both clusters quality and execution time.
Progressive strategies deliver a settled arrangement of segments, with solitary,
comprehensive clusters at the best and singleton groups of individual focuses at the base.
Each halfway dimension can be seen as joining two clusters from the following lower level
(or part a group from the following more elevated amount). The consequence of a
Hierarchical clustering calculation can be graphically shown as tree, called a dendogram.
Rather than Hierarchical strategies, Partitional clustering procedures make a one-level (unsettled) apportioning of the information focuses. In the event that K is the coveted number of
groups, Partitional approaches regularly discover all K clusters without a moment's delay.
Balance this with customary Hierarchical plans, which cut up a group to get two clusters or
consolidation two groups to get one. Obviously, a Hierarchical methodology can be utilized
to produce a level parcel of K groups, and moreover, the rehashed use of a Partitional plan
can infer Hierarchical clustering.
© 2018, IJCSMC All Rights Reserved
240
R.Anushya et al, International Journal of Computer Science and Mobile Computing, Vol.7 Issue.12, December- 2018, pg. 239-248
There are various Partitional methods; however we will just portray the K-means
calculation which is broadly utilized in archive grouping. K-means depends on the possibility
that an inside point can speak to a clusters. Specifically, for K-means we utilize the idea of
centroids, which is the mean or middle purpose of a gathering of focuses. Note that centroids
never relates to a real information point. The calculation is examined in detail in section 5
III. METHODOLOGY
All paragraphs must be indented. All paragraphs must be justified, i.e. both left-justified
and right-justified.
A. Data Collection
This work tries different things with two bench mark datasets "Reuters 21578 distribution
1.0" and Classic dataset gathered from uci.kdd archives. The Reuters-21578 gathering is
dispersed in 22 documents. Every one of the initial 21 records (reut2-000.sgm through reut2020.sgm) contains 1000 reports, while the last (reut2-021.sgm) contains 578 archives.
B. Document Representation
In order to diminish the intricacy of the records and make them less demanding to deal
with, the archive must be changed from the full content adaptation to a report vector which
depicts the substance of the report. The portrayal of an arrangement of archives as vectors in
a typical vector space is known as the vector space model. In the vector space model of IR,
reports are spoken to as vectors of highlights speaking to the terms that happen inside the
gathering. It is additionally named as pack of words, where words are expected to show up
freely and the request is irrelevant. The estimation of each component is known as the term
weight and is typically an element of term's recurrence (or tf-idf) in the archive, alongside
different variables.
Vector Space portrayal of a report includes three steps [7]. Initial step is the archive
ordering where content bearing terms are extricated from the records. The second step is to
register the weights of filed terms to upgrade recovery of archives significant to the client.
The last advance is recognizing the likenesses between the records.
The vector space show is a typical portrayal of content records. Give D a chance to be a
gathering of reports and let T=
be the arrangement of terms showing up in D. A
record
can be represented to as an n−dimensional vector in the term space T. Let
shows up in x, at that point the vector of x is
be a chance to be the occasions a term
characterized as:
(1)
C. Extracting Index Terms
It includes pre processing content records, apply stemming, expel stop words and tokenize
the content. Reports in vector space can be spoken to utilizing Boolean, Term Frequency and
Term Frequency – Inverse Document Frequency.
In Boolean portrayal, in the event that a term exists in a record, the relating term esteem is
set to one else it is set to zero. Boolean portrayal is utilized when each term has break even
with significance and is connected when the reports are of little size.
In Term Frequency and Term Frequency Inverse Document Frequency the term weights
must be set. The term weights are set as the basic recurrence include of the terms the archives.
This mirrors the instinct that terms happen much of the time inside an archive may mirror its
significance more emphatically than terms that happen less every now and again and should
therefore have higher weights.
© 2018, IJCSMC All Rights Reserved
241
R.Anushya et al, International Journal of Computer Science and Mobile Computing, Vol.7 Issue.12, December- 2018, pg. 239-248
Each archive d is considered as a vector in the term-space and spoken to by the term
recurrence (TF) vector:
Where tfi is the recurrence of term i in the archive and D is the aggregate number of
exceptional terms in the content database.
The second factor is utilized to give a higher weight to words that just happen in a couple of
reports. Terms that are restricted to few records are valuable for separating those archives
from whatever is left of the gathering, while terms that happen much of the time over the
whole accumulation aren't useful. The backwards archive recurrence term weight is one
method for doling out higher weights to these more discriminative words. IDF is
characterized by means of the portion N/ni, where, N is the aggregate number of archives in
the accumulation and ni is the quantity of reports in which term i happens.
Subsequently, the tf– idf portrayal of the report d is:
To represent the reports of various lengths, each record vector is standardized to a unit vector
IV. SIMILARITY MEASURES
There are numerous measurements for estimating archive similitude. We centre around
four regular measures in this space which are: cosine similarity [7], Jaccard similarity
coefficient, Euclidean measure and Correlation Coefficient.
A. Cosine Similarity Measure
For record clustering, there are distinctive likeness estimates accessible. The most
regularly utilized is the cosine work. For two records and the similitude between them
can be figured
Since the report vectors are of unit length, the above condition is disentangled to:
At the point when the cosine esteem is 1 the two reports are indistinguishable, and 0 if there
is nothing in like manner between them (i.e., their record vectors are symmetrical to one
another).
B. Jaccard Coefficient
The Jaccard coefficient, which is once in a while alluded to as the Tanimoto coefficient,
measures closeness as the convergence isolated by the association of the articles. For content
report, the Jaccard coefficient thinks about the whole weight of shared terms to the aggregate
weight of terms that are available in both of the two archives however are not the common
terms.
The Cosine Similarity might be stretched out to yield Jaccard Coefficient if there should arise
an occurrence of Binary properties
© 2018, IJCSMC All Rights Reserved
242
R.Anushya et al, International Journal of Computer Science and Mobile Computing, Vol.7 Issue.12, December- 2018, pg. 239-248
C. Euclidean Similarity
This is the most common, "regular" and natural method for figuring a separation between
two examples. It considers the contrast between two examples straightforwardly, in view of
the extent of changes in the example levels. This separation type is generally utilized for
informational collections that are appropriately standardized or with no unique dispersion
issue.
Euclidean Distance
D. Pearson Correlation coefficient
This separation depends on the Pearson correlation coefficient that is computed from the
example esteems and their standard deviations. The relationship coefficient 'r' takes esteems
from – 1 (huge, negative connection) to +1 (expansive, positive relationship). Adequately, the
Pearson separate
is figured as
and lies between 0 (when relationship
coefficient is +1, i.e., the two examples are most comparable) and 2 (when connection
coefficient is - 1).
V. CLUSTERING ALGORITHM
For consequent trials, the standard K-means calculation is picked as the grouping
calculation. This is an iterative Partitional grouping process that means to limit the minimum
squares blunder standard [6]. As made reference to already, Partitional grouping calculations
have been perceived to be more qualified for dealing with extensive record datasets than
Hierarchical ones, because of their moderately low computational prerequisites [17, 19, 18].
The standard K-means calculation fills in as pursues. Given an arrangement of information
objects D and a pre-indicated number of group‘s k, k information objects are haphazardly
chosen to introduce k clusters, every one being the centroids of clusters. The rest of the items
are then appointed to the clusters spoken to by the closest or most comparative centroids.
Next, new centroids are recomputed for each cluster and thusly all records are re-allocated
dependent on the new centroids. This progression emphasizes until a joined and settled
arrangement is achieved, where all information objects stay in a similar group after a refresh
of centroids. The produced clustering arrangements are locally ideal for the given
informational collection and the underlying seeds. Diverse decisions of beginning seed sets
can result in altogether different last parcels. Techniques for discovering great beginning
stages have been proposed [20]. Be that as it may, we will utilize the fundamental K-means
calculation on the grounds that improving the clustering isn't the principle focal point of this
paper.
K-means are outstanding and broadly pertinent clustering calculations. Here, we give a
short portrayal of these calculations
K-means is an iterative grouping calculation. It depends on dividing information focuses
into k groups utilizing the idea of centroids. The cluster centroids is the mean estimation of
the information focuses inside a group. The created parcels highlight high intra-clusters
similitude and between group variety. The quantity of clusters, k, is a pre-decided parameter
of the calculation.
© 2018, IJCSMC All Rights Reserved
243
R.Anushya et al, International Journal of Computer Science and Mobile Computing, Vol.7 Issue.12, December- 2018, pg. 239-248
K-means functions as pursues:
1) K information focuses are self-assertively chose as clusters centroids
2) The similitude of every datum point to each group centroids is computed. At that point
information point are re-allotted to the clusters of the nearest centroids
3) The k centroids are refreshed dependent on the recently allocated information focuses.
4) Stages 2 and 3 are rehashed until the point that combination is come to.
VI. EXPERIMENTAL RESULTS
Here, we cluster our dataset utilizing k-means [24] with each grouping procedure, we
fabricate models utilizing diverse estimations of k and the four similitude estimates depicted
previously. The Rapid Miner plat-shape was utilized in our analysis. This open source stage
gives a well disposed GUI and backings every one of the means of Knowledge Discovery
from Data, including: information pre-handling, information mining, demonstrate approval,
and result perception.
A. Dataset
This work tries different things with two seat stamp datasets "Reuters 21578 dispersion 1.0"
and Classic dataset gathered from uci.kdd vaults. The Reuters-21578 gathering is circulated
in 22 documents. Every one of the initial 21 records (reut2-000.sgm through reut2-020.sgm)
contains 1000 archives, while the last (reut2-021.sgm) contains 578 reports. Records were
increased with SGML labels, and a comparing SGML DTD was delivered, so the limits of
vital segments of reports are unambiguous. Every REUTERS tag contains express
determinations of the estimations of traits, for example, TOPICS, LEWISSPLIT, CGISPLIT,
OLDID, and NEWID. These ascribes are intended to distinguish records and gatherings of
reports. Eg: <TOPICS>, </TOPICS>, <PLACES>, </PLACES>, <BODY>, </BODY>. Each
will be delimited by the labels <D> and </D>. There are 5 classes Exchanges, Organizations,
People, Places and Topics in the Reuters dataset and every classification has again sub
classes altogether 672 sub classifications. We have gathered the TOPICS and PLACES
classification sets to shape the dataset. The TOPICS class set contains 135 classifications and
PLACES class set contains 175 classes. From these records we gather the legitimate content
information of every class by separating the content which is in the middle of <BODY>,
</BODY> and put in a content report and named it as indicated by theme and place.
Classic dataset comprises of four unique accumulations CACM, CISI, CRAN and MED.
We have thought about 800 archives of the aggregate 7095 records.
In these datasets, a portion of the archives comprises single word just, so it is useless to
take such reports for record dataset. For disposing of these invalid reports we apply record
decrease on every classification, which restores the archives that bolsters mean length of
every classification. For record decrease we build the Boolean frameworks of all reports by
classification astute and compute mean length of every class and expelled the archives from
the dataset which doesn't bolster mean length. By this we got legitimate records. From these
substantial archives we have gathered 800 reports of four classifications each. From Reuters
we have thought about 200 records of every classification (ACQ, EARN of TOPICS
classification and UK, USA, of PLACES classification) totalling to 800 reports and from
exemplary dataset 200 archives of every classification again totalling to 800 records.
VII. EVALUATION MEASURES
We utilize entropy as a proportion of nature of the groups (with the proviso that the best
entropy is acquired when each clusters contains precisely one information point). Give CS a
chance to be a grouping arrangement. For each clusters, the class circulation of the
information is ascertained first, i.e., for group j we figure pij, the "likelihood" that an
individual from clusters j has a place with class I. At that point utilizing this class
dissemination, the entropy of each group j is figured utilizing the standard equation.
© 2018, IJCSMC All Rights Reserved
244
R.Anushya et al, International Journal of Computer Science and Mobile Computing, Vol.7 Issue.12, December- 2018, pg. 239-248
Where the entirety is assumed control over all classes. The aggregate entropy for an
arrangement of bunches is figured as the whole of the entropies of each group weighted by
the span of each cluster:
Where is the measure of group j, m is the quantity of bunches, and n is the aggregate
number of information focuses
VIII. RESULTS AND DISCUSSION
In this area, we examine the nature of the gotten grouping models dependent on the
estimations of the bunching assessment measures. We contrast all the gotten models with
locate the best mix of bunching method and closeness measure.
In this work seed focuses are statically picked, however proficiency can be enhanced if
seeds chosen are irregular or run the code more than once to check the productivity. As
appeared in Tables 1a, 1b and Tables 2a 2b, Euclidean separation performs most exceedingly
bad while the execution of different measures is very comparable. From our outcomes it is
seen that Boolean portrayal with Pearson measure has non-zero groups. Consequently the
general entropy for Boolean portrayal table shows NAN esteems for different measures as a
portion of bunches is vacant. On a normal, the Jaccard and Pearson measures are somewhat
better in producing more lucid bunches, which implies the groups have bring down entropy
scores. Table 3a indicates one segment as created by the Boolean Pearson measure utilizing
Reuter‘s dataset, and Table 3b demonstrates one segment as produced by the TF-IDF Jaccard
Coefficient measure utilizing Classic dataset which has the least entropy esteem.
Cosine
Jaccard
Euclidean
Pearson
Boolean
NAN
NAN
NAN
0.33
Freq.
Count
TF-IDF
0.36
0.36
0.42
0.38
0.36
0.38
0.42
0.37
Table 1 a Portrayals Entropy Results of Different Vector Space Using Reuters dataset
Cosine
Jaccard
Euclidean
Pearson
Boolean
NAN
NAN
NAN
0.06
Freq.
Count
0.16
0.12
0.30
0.06
TF-IDF
0.06
0.07
0.30
0.06
Table 2b Entropy Results of Different Vector Space Representations Using Classic dataset
Comparative as over, the Euclidean separation is again turned out to be an insufficient metric
for demonstrating the closeness between reports. The Jaccard and Pearson's coefficient will in
general beat the cosine closeness.
© 2018, IJCSMC All Rights Reserved
245
R.Anushya et al, International Journal of Computer Science and Mobile Computing, Vol.7 Issue.12, December- 2018, pg. 239-248
Cosine
Jaccard
Euclidean
Pearson
Cluster[0]
0.41
0.16
0.38
0.41
Cluster[1]
0.33
0.38
0.44
0.33
Cluster[2]
0.26
0.40
0.40
0.28
Cluster[3]
0.31
0.16
0.42
0.30
Table 3a: TF-IDF Entropy Results
Clusters[0]
Cosine
Jaccard
Euclidean
Pearson
0.05
0.01
0.30
0.01
0.01
0.08
0.30
0.04
0.06
0.07
0.30
0.07
0.13
0.11
0.00
0.10
Clusters[1]
Clusters[2]
Clusters[3]
Table 4b: TF-IDF Entropy Results using Classic dataset
ACQ
EARN
UK
USA
LABEL
Cluster[0]
173
71
64
8
ACQ
Cluster[1]
18
12
107
57
UK
Cluster[2]
8
115
15
12
EARN
Cluster[3]
1
2
14
123
USA
Table 5a: Bunching Results from Boolean Pearson Correlation Measure utilizing Reuters dataset
CAC
CIS
CRA
MED
LABEL
Cluster[0]
0
1
0
166
MED
Cluster[1]
8
5
199
30
CRA
Cluster[2]
3
166
0
4
CIS
Cluster[3]
189
28
1
0
CAC
Table 6b: Clustering Results from TFIDF Jaccard Measure using Classic dataset
© 2018, IJCSMC All Rights Reserved
246
R.Anushya et al, International Journal of Computer Science and Mobile Computing, Vol.7 Issue.12, December- 2018, pg. 239-248
We have utilized the grouping precision as a proportion of a clustering result. Grouping
precision r is characterized as
Where ai is the quantity of examples happening in both clusters i and it‘s relating class and n
is the quantity of occurrences in the dataset. The clustering precision is more for TF-IDF
portrayal with Pearson‘s and Jaccard coefficient measures. The great dataset has appeared
over 94 percent precision.
IX. CONCLUSION AND FUTURE WORK
In this study we discovered that every one of the measures have huge impact on Partitional
clustering of content reports with the exception of the Euclidean separation measurer.
Pearson connection coefficient is marginally better as the subsequent clustering arrangements
are more adjusted and is closer to the physically made classes. The Jaccard and Pearson
coefficient estimates discover more sound groups. Considering the kind of group
investigation engaged with this examination, we can see that there are three parts that
influence the last outcomes—portrayal of the records, separation or comparability estimates
considered, and the clustering calculation itself. In our future work our intension is to apply
semantics learning to the archive portrayals to speak to connections among terms and
concentrate the impact of these similitude measures thoroughly.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
Han, J., Kamber, M... Data Mining: Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems.
Morgan Kaufmann; 3rd ed.; 2011. ISBN 978-0-12-381479-1.
Aggarwal, C.C., Zhai, C... A Survey of Text Clustering Algorithms. In: Aggarwal, C.C., Zhai, C., editors. Mining Text Data.
Springer US; 2012, p. 77–128.
Luo, C., Li, Y., Chung, S.M... Text document clustering based on neighbours. Data & Knowledge Engineering 2009;
68(11):1271–1288.
Hartigan, J.A... Clustering Algorithms. New York, NY, USA: John Wiley & Sons, Inc.; 99th ed.; 1975. ISBN 978-0-471-356455.
Elkan, C.. Using the Triangle Inequality to Accelerate k-Means. In: Fawcett, T., Mishra, N., editors. Machine Learning,
Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA. AAAI Press;
2003, p. 147–153.
Kaufman, L. and Rousseeuw, P.J., . Clustering by means of Medoids. In: Y. Dodge and North-Holland, editor. Statistical Data
Analysis Based on the L1-Norm and Related Methods. Springer US; 1987, p. 405–416.
Blair, D.C... Information Retrieval, 2nd ed. C.J. Van Rijsbergen. London: Butterworths; 1979: 208 pp. Price: $32.50. Journal of
the American Society for Information Science 1979; 30(6):374–375.
Bide, P., Shedge, R.. Improved Document Clustering using k-means algorithm. In: 2015 IEEE International Conference on
Electrical, Computer and Communication Technologies (ICECCT). 2015, p. 1–5.
Lang, K.. 20 newsgroups data set. 2008 (accessed 2015-12-18). ‖http://www.ai.mit.edu/people/jrennie/20Newsgroups/‖.
C. J. Van Rijsbergen, (1989), Information Retrieval, Buttersworth, London, Second Edition.
R. Malarvizhi, Dr. Antony Selvadoss Thanamani, ―K-Nearest Neighbour in Missing Data Imputation‖, International Journal of
Engineering Research and Development, Volume 5 Issue 1-November-2012,
K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imputation, International Journal for Research in
Science & Advanced Technologies,Vol 1.Issue-2,2013,Ms.R.Malarvizhi and Dr.Antony Selvadoss Thanamani.
Used Mathematical Models For Finding Multiple Data Imputation In Main Stream, International Journal of Emerging Trends in
Science and Technology, IJETST- Vol.||03||Issue||05||Pages 540-545||May||ISSN 2348-9480. Mrs. P.Logeshwari and Dr.Antony
Selvadoss Thanamani.
Chitra, V & Antony Selvadoss Thanamani 2010, ‗A Survey on Pre-processing Methods for web Usage Data‘, (IJCSIS)
International Journal of Computer Science and Information Security,vol. 7, no. 3, pp. 78-83.
k. Sashi, A.S. Thanamani, dynamic replication in a data grid using a modified bhr region based algorithm, future generation
computer systems 27 (2), (2011), pp. 202 210.
R. Malathi Ravindran and Antony Selvadoss Thanamani, ―K-Means Document Clustering using Vector Space Model‖, Bonfring
International Journal of Data Mining, Volume 5, Issue 2, July 2015, Pages 10-14.
Umajancy.S, Dr. Antony Selvadoss Thanamani ―An Analysis on Text Mining-Text Retrieval and Text Extraction ―International
Journal of Advanced Research in Computer and Communication Engineering Vol. 2, Issue 8,August 2013
Umajancy.S, Dr. Antony Selvadoss Thanamani ―An Analysis on Text Mining-Text Retrieval and Text Extraction ―International
Journal of Advanced Research in Computer and Communication Engineering Vol. 2, Issue 8,August 2013
V Chitraa, Dr.Antony Selvadoss Thanamani, ―A survey on preprocessing methods for web usage data‖, International Journal of
Computer Applications (0975 – 8887) Volume 34– No.9, November 2011.
K Jothimani , Dr.Antony Selvadoss Thanamani, ―An Algorithm for Mining Frequent Itemsets‖, IJCSET |March 2012| Vol 2,
Issue 3,1012-1015
Kanchana S. and Antony Selvadoss Thanamani, ―BOOSTING THE ACCURACY OF WEAK LEARNER USING SEMI
SUPERVISED CoGA TECHNIQUES‖, VOL. 11, NO. 15, AUGUST 2016 ISSN 1819-6608 ARPN Journal of Engineering and
Applied Sciences ©2006-2016 Asian Research Publishing Network (ARPN). All rights reserved.
© 2018, IJCSMC All Rights Reserved
247
R.Anushya et al, International Journal of Computer Science and Mobile Computing, Vol.7 Issue.12, December- 2018, pg. 239-248
[22] Priyadharsini.C, Dr. Antony Selvadoss Thanamani , ―Imputation of Missing Data Using Ensemble Algorithms‖ International
Journal of Modern Computer Science (IJMCS) Volume 5, Issue 1, February, 2017. ISSN: 2320-7868 (Online) Page No: 20 to 23
[23] Priyadharsini.C, Dr. Antony Selvadoss Thanamani ,‖ An Improved Novel Index Measured Segmentation Based Imputation
Algorithm for Missing Data Imputation‖, International Journals of Advanced Research in Computer Science and Software
Engineering (UGC Approved No: 48958) Registered DOI: www.dx.doi.org / 10.23956/ IJARCSSE. ISSN: 2277-128X (Volume7, Issue-6) June, 2017 page no: 283-286
[24] Priyadharsini.C, Dr. Antony Selvadoss Thanamani, ―A Novel Index Measured Segmentation Based Imputation Algorithm (with
Cross Folds) for Missing Data Imputation‖ International Journal of Electrical Electronics & Computer Science
Engineering.(UGC Approved No:44927), Volume 4, Issue 3 (June, 2017) | E-ISSN : 2348-2273 P-ISSN : 2454-1222, page no:
22-24
[25] M. Ramaraj, Dr. Antony Selvadoss Thanamani " Plagiarism detection paradigm for web content using similarity analysis
approach" ISSN2348 -9928IJAICTVolume1,Issue5,September2014Doi:01.0401/ijaict.2014.04.01Published on05 (10) 2014 ©
2014 IJAICT(www.ijaict.com)
[26] Kiri Wagstaff, Claire Cardie, Seth Rogers, Stefan Schroedl Constrained K-means clustering with Background Knowledge.
Proceedings of the Eighteenth International Conference on Machine Learning, pages 577 − 584, 2001.
[27] Johny Antony P, Dr. Antony Selvadoss Thanamani, ―A Privacy Preservation Framework for Big Data (Using Differential
Privacy and Overlapped Slicing)‖, International Journal of Engineering Research in Computer Science and Engineering
(IJERCSE) Vol 3, Issue 10, October 2016
[28] R.NANDHAKUMAR1 AND ANTONY SELVADOSS THANAMANI, ―A Survey on E-Health Care for Diabetes Using Cloud
Framework‖, International Journal of Advanced Research Trends in Engineering and Technology (IJARTET) Vol. 4, Issue 10,
October 2017
© 2018, IJCSMC All Rights Reserved
248