A Tag - Tree For Retrieval From Multiple Domains of A Publication System
A Tag - Tree For Retrieval From Multiple Domains of A Publication System
A Tag - Tree For Retrieval From Multiple Domains of A Publication System
Web Site: www.ijaiem.org Email: [email protected], [email protected] Volume 3, Issue 2, February 2014 ISSN 2319 - 4847
Associate Professor, Department of Computer Science, Madurai Kamaraj University, Madurai, TN, India. 2 Research Scholar, Department of Computer Science, Madurai Kamaraj University, Madurai, TN, India.
ABSTRACT
It is very difficult for the researchers to collect the literature for their research. The digital libraries of the publication systems are very useful for them. Publication system has set of literatures of various domains. The effectiveness of the Publication system is measured from various aspects; especially indexing and retrieval. This paper shows how effectively the TAG-tree is used for a publication system. The results show that it performs well when compared to other existing techniques.
Keywords: Context-based search, TAG-tree, TAG+- tree, Publication search, Information Retrieval.
1. INTRODUCTION
In the traditional library, people get the help of Information Specialist, who assist others to find information. In this modern technology, the traditional libraries are promoted to digital libraries, which is readily available in our systems through online. The role of the Information Specialist is now played by Information Retrieval Systems (IRS). Thus making the IRS more efficient and potential enough to satisfy the user information need is the major research issue in this field. In general, retrieving the relevant documents from a digital collection is based on matching the character strings rather than retrieving meanings. The user has to select such a powerful string (query) that represent the information need accurately, and return the relevant documents. It is undoubtedly known fact, that all information needs cannot be expressed by a unique character strings. A single word may have different meanings. This may lead to topic diffusion. This indirectly increases the result set with unwanted results. The only solution is making the IRS to treat the query as Context instead of a set of character strings [6]. Contexts are the keywords along with its synonyms. When the IRS treats the query as Context, then it will find the relation between the keywords of the query. Thus making the IRS effective, it has to perform Context-based Search. For an effective retrieval, the concentration needs to be given to the indexing system also. When the collection increases tremendously, we can retrieve the relevant documents, only if it is indexed in such a way that it is easily reached. Earlier the linear approach is used for indexing, which increases the disk access time. Thus the hierarchical approach is useful, which consumes a less amount of disk access and also limits the number of comparisons. There are more researches done and it shows that the tree based algorithms are good. To reach the object within one comparison, Hash based techniques are useful. The hash value of the object is identified based on which the respective bucket of the hash table is found directly. Thus it needs only O(1) time which also reduces the computational cost. The rest of the paper is organized as follows. We discuss related work in Section 2. In Section 3 we describe the general architecture of the system, whereas experimental results are presented in Section 4 and conclude in Section 5 with the future directions.
2. RELATED WORK
There are more researches were done in this field of retrieval. When the growth of the information volume increases, there occurs more issues such as the change in the behaviour of the data and the relationship between the documents, the way of handling it and so on. Thus it stills the thirst area of research. Various data structures were found in the literature used in various steps of retrieval. In [3], kd-tree is used for query matching, which is a binary tree with storage requirements proportional to the number of records to be searched. A record is a term or document represented by a vector of length k. The tree has to traverse until the leaf is reached to get the closet documents. It is time consuming, since it takes more number of comparisons. The distance from the root is maintained for all nodes. For very large document collections, this method of finding nearest neighbours to the query can be computationally expensive. To overcome the issue discussed above, in [4], J+-tree is used, which is a height balanced tree. The difference between B+tree and J+-tree is that, it increases the size of the buckets of the parent to hold more elements with no change in the height of the tree. When the bucket size increases more, it has to compare more number of elements to reach the element in the child. Very frequently indexing is done by considering the characters in the keywords. For example, in [9], Suffix Tree is constructed with the terms based on the character sequence. This is feasible only if the index terms are very small. There are more indexing structures to handle large volumes of data are available in the literatures. For example, R-tree
Page 72
Figure 1 The TAG+ architecture 3.1 TAG+: A modified indexing structure Figure 2 shows the architecture of the new indexing technique TAG+. TAG+ is a combination of hash table and TAG trees. Each bucket of the hash table is connected to a TAG tree, which is filled with the Contexts (keywords and
Page 73
Figure 2 The indexing structure of TAG+-tree As in TAG technique, the Contexts of each domain are in the form of the pattern such as <prefix> <context> <suffix>. From each domain, the key terms are extracted with the help of Subject Experts manually. Only in the leaf nodes of TAGtree contains the publications of its pattern along with a pointer to the list which contains its synonyms. When a query is posted, TAG Retriever will locate the relevant contexts from the TAG+ indexing structure to get the relevant publications. Also it retrieves the publications which are indentified from the parser database. Finally the publications from the stored Usage History, which are relevant to the given query, are all combined together. Based on the scores of these resultant publications, they are ranked. The top most limited resultant publications are returned, as a result set. To retrieve the result from the TAG+-tree, the following algorithm is used. First the hash value of the queried domain is calculated and the relevant domain pointer is identified. Now the query is searched against the TAG-tree of the particular domain. Algorithm: MRetrieval Data Structure: See Figure 2. Input: d domain id q Query <qp , qc, qs> Output: RPset Resultant Publications. Let RPset Resultant Publications id with its scores qp Prefix tuple of the given query qc Context tuple of the given query qs Suffix tuple of the given query mRetrieval(domain_id d, Query q) { hvd = hash(d); //computes the hash value for the queried domain id d ConSynRet(q, hvd dtp); //Retrieval algorithm in TAG [8] return (RPset); }
4. PERFORMANCE STUDY
In this section we describe our experiments to assert the usefulness of TAG+ indexing technique in the task of indexing/retrieval of documents in multi-domain publication system. Many experiments were conducted on this system, but presented only most important results. This system is compared to the earlier model D-tree. These techniques were implemented in JAVA. All experiments reported in this section were conducted on Processor Intel Core i7 @ 2.30GHz with 8 GB RAM and 1 TB hard disk, running Windows 7. The test data was created with a multi-domain publication system. For testing, 20000 full text articles from 20 different domains were collected manually from various publication systems like IEEE, Medline and so on. This collection is queried with some phrases that are common for more than one domain to evaluate the performance of the proposed approach. 4.1 Indexing Time After the pre-processing of the documents, it is indexed with the selected structure. This study focuses on indexing time for the documents by varying the size of the data, shown in figure 3.
Page 74
Figure 3 Time taken for indexing Due to the linear handling approach and the multidimensional property of D-tree, it involves in more number of comparisons for indexing. D-tree uses the approximate matching, which will increase the number of documents as well as computational time for a particular topic, where as our TAG+ technique uses the pattern based approach that consumes less amount of time for indexing. The performances of these two structures were measured and TAG+ technique performs better than D-tree. 4.2 Retrieval Time Figure 4 shows the retrieval time for different context. In this study, the maximum size of the collection is fixed to 20000, and the systems are queried for documents in various domains.
Figure 4 Time elapsed for retrieval The search boundary in a TAG+ method is very minimum compared to the D-tree method. This leads to access of most needed documents from a minimal data set. The retrieval performance of a new method improves much compared to Dtree. 4.3 Relevancy The main issue in searching is that, retrieved results should satisfy the users information need. The size of the collection increases in various domains, it is too difficult to retrieve the exact documents by means of a single query. Figure. 5 & 6 discusses the relevancy measure of these techniques. To test the topic diffusion problem, the queries were chosen in such a way that, the terms involved in the query were common to more than one domain. For this study the entire collection is queried and the results were evaluated manually by the domain experts.
Page 75
References
[1] Frank S. C. Tseng, Wen-Ping Lin, D-Tree: A Multi-Dimensional Indexing Structure for Constructing Document Warehouses, Journal of Information Science and Engineering, XXII, pp. 819-841, 2006. [2] Guttman A, R-trees: a dynamic index structure for spatial searching, In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 47-57, 1984. [3] Hughey M. K, Berry M. W, Improved Query Matching Using kd-trees: A Latent Semantic Indexing Enhancement, Information Retrieval, II (4), pp.287-302, 2000. [4] Luan H, Du XY, Wang S, Prefetching J+-tree: A Cache-Optimized Main Memory Database Index structure, Journal of Computer Science and Technology, XXIV (4), pp. 687-707, 2009. [5] Ooi B. C, Sacks-Davis R, McDonell K. J, Spatial k-d-tree: an indexing mechanism for spatial databases, In Proceedings of the IEEE 11th Annual International Computer Software and Applications Conference (COMPSAC 87), pp. 433-438, 1987. [6] Ratprasatporn N, Po J, Ali Cakmak, Bani-Ahmad S, Ozsoyoglu G, Context-based literature digital collection search, The VLDB Journal, XVIII (1), pp. 277 301, 2009. [7] Sellis T. K, Roussopoulos N, Faloutsos C, The R+-tree: a dynamic index for multi-dimensional objects, In Proceedings of 13th International Conference on Very Large Data Bases (VLDB), pp. 507-518, 1987. [8] Thangaraj M, Gayathri V, A context-based technique using tag-tree for an effective retrieval from a digital literature collection, Journal of Computer Science, IX (11), pp. 1602-1617, 2013. [9] Yuanyuan Tian, Sandeep Tata, Richard A. Hankins, Jignesh M. Patel, Practical methods for constructing suffix trees, The VLDB Journal, XIV (3), pp. 281-299, 2005.
AUTHORS
M Thangaraj received his post-graduate degree in Computer Science from Alagappa University, Karaikudi, M.Tech. degree in Computer Science from Pondicherry University and Ph.D. degree in Computer Science from Madurai Kamaraj University, Madurai, TN, South India in 2006. He is now the ASSOCIATE PROFESSOR of Computer Science Department at Madurai Kamaraj University. He is an active researcher in Web mining, Semantic Web, Information Retrieval and Big-Data Analytics and has published more than 60 papers in Journals and Conference Proceedings.
Page 76
Page 77