Papers by Tshering Cigay Dorji
With the tremendous growth in the volume of unstructured textual data on the Internet and company... more With the tremendous growth in the volume of unstructured textual data on the Internet and company-wide intranets, there is a growing need for automated techniques to “efficiently organize, classify, label, and extract relevant information” (Berry, 2004; Berry & Castellanos, 2008). Merrill Lynch in 1998 cited estimates that as much as 80% of all potentially usable business information originates in unstructured form (Shilakes & Tylman, 1998). Therefore, extracting interesting and non-trivial patterns or knowledge from unstructured data has assumed great importance in fields ranging from business to engineering and biomedical researches (Ramakrishnan, 2009). Existing techniques may be broadly divided into those based on knowledge engineering approach (Hayes & Weinstein, 1990) and those based on machine learning approach (Pang & Kasabov, 2009; Peng et al., 2008; Graham-Cumming, 2005). Because of the huge amount of time and expertise required to create and maintain knowledge encoding ru...
Every region of Bhutan abounds with rich oral traditions that include folktales, local myths, and... more Every region of Bhutan abounds with rich oral traditions that include folktales, local myths, and legends related to the local history, landforms, and place-names. These oral traditions have been a source of value education as well as entertainment in our traditional rural societies, and they hold the essence of our unique culture and traditions. However, unless we act today, our invaluable oral traditions are in danger of extinction soon due to the sweeping forces of globalization and commercial entertainment that have already reached even remote areas of Bhutan. With the help of examples, this paper provides a brief analysis of the traditional values transmitted by our folktales and the functions served by local legends and myths in Bhutanese society. Finally, this paper offers some practical recommendations for collecting our folktales, myths, and legends in the form of text, audio, and video using the currently available digital technology to create the first comprehensive and d...
International Journal of Computer Applications in Technology, 2015
Popular text classification algorithms such as Naive Bayes, kNN, Centroid-based classifiers and s... more Popular text classification algorithms such as Naive Bayes, kNN, Centroid-based classifiers and support vector machines SVM are based on supervised machine learning. They normally use classical text representation technique consisting of a 'bag of words' as features. This representation leads to the inclusion of unimportant features, and the loss of important semantic relationships and inflection information, resulting in accuracy reduction. To address this problem, we propose a new text classification methodology based on field association terms - a set of terms that identify specific document fields. The methodology is compared against Naive Bayes, kNN, Centroid-based classifier and SVM on a close dataset of 3180 documents from Wikipedia dumps and open dataset of 9449 documents from Reuters RCV1 Corpus, 20-Newsgroup and 4-Universities datasets. The new method outperformed the other algorithms with a precision of 97% as compared with Centroid-based 85%, Naive Bayes 78%, kNN 48% and SVM 42%.
Knowledge and Information Systems, 2010
Field Association (FA) Terms—words or phrases that serve to identify document fields are effectiv... more Field Association (FA) Terms—words or phrases that serve to identify document fields are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract and select relevant FA Terms to build a comprehensive dictionary of FA Terms. This paper presents a new method to extract, select and rank
Information Processing & Management, 2008
Information retrieval involves finding some desired information in a store of information or a da... more Information retrieval involves finding some desired information in a store of information or a database. In this paper, Co-word analysis will be used to achieve a ranking of a selected sample of FA terms. Based on this ranking a better arranging of search results can be achieved. Experimental results achieved using 41 MB of data (7660 documents) in the field of sports. The corpus was collected from CNN newspaper, sports field. This corpus was chosen to be distributed over 11 subfields of the field sports from the experimental results, the average precision increased by 18.3% after applying the proposed arranging scheme depending on the absolute frequency to count the terms weights, and the average precision increased by 17.2% after applying the proposed arranging scheme depending on a formula based on ''TF * IDF'' to count the terms weights.
Information Processing & Management, 2010
Minimal Prefix (MP) double array is an efficient data structure for a trie. However, its space ef... more Minimal Prefix (MP) double array is an efficient data structure for a trie. However, its space efficiency is degraded by the non-compact management of suffixes. This paper presents three methods to compress the MP double array. The first two methods compress the MP double array by accommodating short suffixes inside the leaf nodes, and pruning leaf nodes corresponding to the end marker symbol. These methods achieve size reduction of up to 20%, making insertion and deletion faster at the same time while maintaining the retrieval time of O(1). The third method eliminates empty spaces in the array that holds suffixes, and improves the maximum size reduction further by about 5% at the cost of increased insertion time. Compared to a Ternary Search Tree, the key retrieval of the compressed MP double array is 50% faster and its size is 3-5 times smaller.
Uploads
Papers by Tshering Cigay Dorji