Academia.eduAcademia.edu

IndoWordNet Application Programming Interfaces

Work is currently under way to develop WordNet for various Indian languages. The IndoWord-Net Consortium consists of member institutions developing their own language WordNet using the expansion approach. Many tools and utilities have been developed by various institutes to help in this process. In this paper, we discuss an object oriented Application Programming Interface (API) that we have implemented to facilitate easy and rapid development of tools and other software resources that require WordNet access and manipulation functionality. The main objective of IndoWordNet Application Programming Interface (IWAPI) is to provide access to the WordNet resource independent of the underlying storage technology. The current implementation manipulates data stored in a relational database. Furthermore the IWAPI also supports parallel access and manipulation of WordNets in multiple languages.

IndoWordNet Application Programming Interfaces N eha R P r a bhugaonkar 1 Apur va S N a g venkar 1 Ramd as N Kar mali 1 (1) GOA UNIVERSITY, Taleigao - Goa [email protected], [email protected], [email protected] ABSTRACT Work is currently under way to develop WordNet for various Indian languages. The IndoWordNet Consortium consists of member institutions developing their own language WordNet using the expansion approach. Many tools and utilities have been developed by various institutes to help in this process. In this paper, we discuss an object oriented Application Programming Interface (API) that we have implemented to facilitate easy and rapid development of tools and other software resources that require WordNet access and manipulation functionality. The main objective of IndoWordNet Application Programming Interface (IWAPI) is to provide access to the WordNet resource independent of the underlying storage technology. The current implementation manipulates data stored in a relational database. Furthermore the IWAPI also supports parallel access and manipulation of WordNets in multiple languages. In this paper, we discuss functional requirements, design and the implementation of IndoWordNet API and its uses. KEYWORDS: WordNet, Application Programming Interface (API), WordNet CMS, IndoWordNet, IndoWordNet Database, WordNet Website. Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP), pages 237–244, COLING 2012, Mumbai, December 2012. 237 1 Introduction An Application Programming Interfaces (API) is defined as a set of commands, functions and protocols which developer can use when building software. It allows the developer to use predefined functions to interact with systems, instead of writing them from scratch. APIs are specially crafted to expose only chosen functionality and/or data while safeguarding other parts of the application which provides the interface. The characteristics of good API (Joshua Bloch, 2007) are as follows: • • • • Easy to learn and use, hard to misuse. Easy to read and maintain code that uses it. It is programming language neutral. Sufficiently powerful to support all computational requirements. The IndoWordNet API provides a simple and easy way to access and manipulate the WordNet resource independent of the underlying storage technology. The functionality is exposed through a set of well defined objects that developer can create and manipulate as per his/her processing requirement. Although the current implementation expects the data to be available in a relational database, a two layered architecture separates functionality offered to the user from the data access functionality. This allows for future enhancements to support any data storage technology and design without changing the API provided to the developer. The IndoWordNet API allows parallel access and updates to single or multiple language WordNets. A new design using relational database has been implemented for this purpose. This database design (IndoWordNet database) supports storage of multiple language WordNets. An effort has been made to optimize the design to reduce redundancy. Certain data common across all languages i.e. ontology information, semantic relationships, etc are stored in a separate master database and data specific to a language i.e. synsets, lexical relationships, etc are stored separately for each language in the database of respective language. The rest of the paper is organised as follows – section 2 discusses functional requirements of IWAPI, section 3 presents the architecture and design of IWAPI. The implementation details of IWAPI are presented in section 4. Section 5 presents the conclusion. 2 Functional Requirements Users often have to rely on others to perform functions that he/she may not be able or permitted to do by themselves. Similarly, virtually all software has to request other software to do some things for it. To accomplish this, the asking program uses a set of standardized requests, called application programming interfaces that have been defined for the program being called upon. Developer can make requests by including calls in the code of their applications. The syntax is described in the documentation of the application being called. By providing a means for requesting program services, an API is said to grant access to or open an application. Following information needs to be maintained for any WordNet resource. The synsets of the langauge which includes concept definition, usage examples and a set of synonym words (Miller, 1993). Each synset also belongs to a specific lexical class, namely noun, adjective, verbs and adverbs. There is also an ontology maintained and every synset maps to a specific ontology node in this hierarchy. The synsets are related through semantic relations and words are related through lexical relations. IWAPI should allow developer to access and manipulate above information. Besides the above requirement it was also felt that it should be possible to maintain additional information about the synsets i.e. an image describing a concept, pronunciations of 238 words in the synsets, links to websites and other resources, etc. The IWAPI should also support storage and access to such additional information. The developer based on his application requirement may also require accessing multiple language WordNets simultaneously. The IWAPI should also support this feature. 3 Architecture and Design The IWAPI has two layered architecture. The upper layer is the Application layer and the lower layer is the Data layer. The class diagram of IndoWordNet API (Application layer) is shown below. The Application layer exposes the set of classes and methods which the developer will use to access and manipulate the WordNets as discussed in section 2. The Application layer does not directly access the data stored on the disk but uses Data layer for this purpose. The Data layer provides this service through a set of data classes and methods which it exposes to the Application layer. The Data layer understands the design and storage technology used to store the data i.e. relational database, flat text files, indexed files, XML etc. The Data layer is responsible for actual access and manipulation of data stored in files/database and is expected to reorganize the data in memory so that it can be exposed to the Application layer using the Data objects. This protects the Application layer from changes in storage technology or storage design. In the current implementation the Data layer accesses the data from relational Figure 1: A simplified Class diagram of IndoWordNet API (Application Layer). database. In future any changes in the storage method would only require a new Data layer to be implemented for that specific storage technology. Only the Application layer classes are exposed to the developer and Data layer classes are hidden so any changes will not affect the tools and software already created by the developer. The implementation maintains the classes and objects belonging to the different layers in separate libraries facilitating replacement of Data layer when desired. 239 Some of the important classes of Application Layer are as follows: 1. IWAPI : A static class that allows initialising the IndoWordNet API library for use. To use the IndoWordNet API the first thing you need to do is authenticate the user i.e. IWAPI.init ("username","password"); This class manage connectivity to language specific databases. By establishing a single connection with the master database, you can connect to multiple language specific databases using, IWAPI.getLanguageObject(IWLanguageConstants.KONKANI); where IWLanguageConstant is static class that contains all the constant names for language specific databases. It also allows developer to maintain Meta information: • To get/add/delete various lexical and semantic relation. • To get/add/delete various property values of lexical relation such as action, amount, color, direction, etc and semantic relations. • To manipulate domain names, grammatical categories, ontology nodes, ontology hierarchy, etc. 2. IWLanguage: A class that provides connection to language WordNets by using IWLanguage langObj = IWAPI.getLanguageObject(IWLanguageConstants.KONKANI); Using the langObj i.e the object of IWLanguage it allows the developer: • To get the total number of synsets, total number of words of a given language. • To get all synsets/words belonging to a domain/ category/range of Id’s. • To create/destroy a new synset, word, domain, category, ontology, relation, etc. • To get words and synsets having a given semantic and lexical relations. 3. IWSynset: A class that represents a synset. The synset object of IWSynset class allows developer: • To get the concept, translated concept, transliterated concept of a given synset which is present in the database. To get usage examples of the synset. • To get/set the category, domain, source, concept of a given synset. • To add/remove examples, files and various semantic/lexical relations of a synset. 4. IWSynsetCollection: A class that represents a collection of synsets. It allows developer: • To get the size of the collection i.e. the number of synsets present in the collection, using the method, count(); • To iterate through the collection, using getElement(); first(); next(); previous(); last(); respectively. 5. IWWord: A class that represents a word. The word object of IWWord class allows developer: • To get the id of the word, to get synsets for a given word. • To get various lexical relation such as antonymy relation, compounding relation, gradation relation, etc. • To add/remove various lexical relations for specified synset and word such as antonymy relation, compounding relation, gradation relation, etc. 6. IWWordCollection: A class that represents a collection of words for a synset. It allows developer: • To get the size of the collection i.e. the number of words present in the collection. • To iterate through the collection, using getElement(); first(); next(); previous(); last(); respectively. • It also allows deleting a word, to insert a word at a particular location, to move a word to a particular location, to change the priority of the words, etc. 7. IWExampleCollection: A class that represents a collection of examples for a synset. It 240 8. 9. 10. 11. allows developer: • To get the size of the collection i.e. the number of examples present in the collection. • To iterate through the collection, using getElement(); first(); next(); previous(); last(); respectively. • To move the current example at a specified location, to insert a new element in a collection at a specified location, to insert a new element at last position in the collection, etc. IWFile: A class that represents files. Using the object of IWFile class it allows developer: • To get/set the file content, file Id, file size, file type, etc. IWOntology: A class that represents ontology node. Each synset is mapped to an ontology node in the ontology tree. Using the object of IWOntology class the developer can • get/set the ontology Id, ontology data, ontology translated data, ontology transliterated data, • get/set the ontology description, ontology translated description, and ontology transliterated description from the database. IWOntologyCollection: Collection of child nodes for a given onto node. Using the object of IWOntology class it allows developer: • To get the size of the collection i.e. the number of ontology nodes present in the collection. • To iterate through the collection, using getElement(); first(); next(); previous(); last(); respectively. Similarly we have classes such as IWAntonymyCollection, IWGradationCollection, IWMeroHoloCollection, IWNounVerbLinkCollection, etc. IWException: A class that defines all the exceptions which occurs in case of error or failure. Note: There are additional classes like IWAntonymy, IWGradation and IWMeronymyHolonymy which are the private classes used internally in the API and are hidden from the developer. The Data layer will change depending on the storage technology but the Application layer will remain unchanged. The Data layer deals with encapsulation of the storage design. It provides a standard interface to the application layer. The Data layer supports all the operations needed to be performed on the data. Data Layer consists of the following important classes: 1. IWDb: A class that represents to a database/file store. 2. IWCon: A class that represents up an authenticated connection to a database/file store. 3. IWStatement: A class which contains all data manipulation functionality required by the Application layer. 4. IWResult: A class which returns result to the application layer i.e. synsets, collections, etc. 4 Implementation Reference implementation of IndoWordNet API is done in JAVA and PHP. The size of JAVA jar file is 224 KB and the size of PHP API package is 452 KB. The IndoWordNet API classes are stored in 4 packages: • unigoa.indowordnet.api: This package consists of important classes of Application Layer such as the IWAPI, IWLanguage, IWSynset, etc. as discussed earlier in section 3. • unigoa.indowordnet.constants: This package consists of static classes which define all 241 the constants used by API and do not contain any methods. • unigoa.indowordnet.maintenance: This package consists of a class which allows the developer to maintain master data. • unigoa.indowordnet.storage: This package consists of classes of Data Layer such as the IWDb, IWCon, IWStatement and IWResult. They provide a standard interface to the application layer and their implementation will change depending on the storage technology. Figure 2: Example to create a new synset using IndoWordNet API. An example to illustrate the use of IndoWordNet API is shown above. The basic flow is as follows: Initialize the API library and authenticate the user (line no. 10). Connect to a language WordNet (Konkani WordNet line 11-12). The same method can be used repeatedly to create connections to multiple language WordNets. Each call will return an object representing connection to a language WordNet. This facilitates simultaneous access to multiple WordNets. Create a new synset using the language object (line no. 14). Add a word or an example to the new synset (line no. 19-20). Similarly, to get synset information of a given synset Id. First, create the object of the synset, and use the synset object to get synset information (line no. 22-39). Conclusion The IndoWordNet API can be used to facilitate easy and rapid development of tools and other WordNet related software resources with minimal effort, with enhanced features by a developer in a very short time for any language. The ease and speed at which new tools to support research community can be developed was demonstrated during the implementation of WordNet Content Management System(CMS) which used IndoWordNet API implemented in PHP. A demo paper on WordNet Content Management System has been accepted at COLING 2012. The Konkani WordNet website is deployed using WordNet CMS. The IndoWordNet API has been successfully used by other IndoWordNet members to develop web-based tools such as bi-lingual dictionary, multi-lingual dictionary, Lexical Relation Tool to capture various lexical relations such as antonym, gradation, etc. A Synset Management System is under development to assist creation of language specific synsets and manage their linkages to other Indian language WordNets. We expect that in future the IndoWordNet API’s will be used for the development of tools and software resources by IndoWordNet members. 242 References George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller 1993. Introduction to WordNet: An On-line Lexical Database. George A. Miller 1995. WordNet: A Lexical Database for English. Joshua Bloch 2007. How to Design a Good API and Why it Matters. Pushpak Bhattacharyya, IndoWordNet, Lexical Resources Engineering Conference 2010 (LREC2010), Malta, May, 2010. Pushpak Bhattacharyya, Christiane Fellbaum, Piek Vossen 2010. Principles, Construction and Application of Multilingual WordNets, Proceedings of the 5th Global Word Net Conference (MumbaiIndia), 2010. 243