Skip to main content

Martin Rajman

Ecole Polytechnique Federale de Lausanne, Nano-Tera, Executive Director

Followers

30

Following

11

Co-authors

8

Public Views

University of Warsaw

Université Bordeaux-Montaigne

Bogazici University

Manfred Malzahn

United Arab Emirates University

Chelo Vargas-Sierra

University of Alicante / Universidad de Alicante

Volga Yılmaz Gümüş

Anadolu University

Jerónimo Pizarro

Universidad de los Andes (Colombia)

Viacheslav Kuleshov

Stockholm University

د. عبد الناصر محمد علي نقيب الرضامي

University of Aden

Antonio Donizeti da Cruz

Universidade Estadual do Oeste do Paraná

Interests

Uploads

Papers by Martin Rajman

We present Alvis peers, a full-text P2P retrieval engine designed to offer retrieval performance ... more We present Alvis peers, a full-text P2P retrieval engine designed to offer retrieval performance comparable to centralized solutions while scaling to a very large number of peers. It is the result of our research efforts within the project Alvis 1 that aims at building a truly-distributed semantic search engine. To cope with problem of unscalable bandwidth consumption in the P2P network, the engine implements a novel retrieval model that indexes highlydiscriminative keys (HDKs)-terms and term sets appearing in a limited number of collection documents. Our prototype is a fully-functional retrieval engine built over a structured P2P network. It includes a component for HDK-based indexing and retrieval, and a distributed content-based ranking module. Such an integrated system represents a substantial contribution to the design and development of realistic P2P retrieval systems.

Query-driven indexing for peer-to-peer text retrieval

We describe a query-driven indexing framework for scalable text retrieval over structured P2P net... more We describe a query-driven indexing framework for scalable text retrieval over structured P2P networks. To cope with the bandwidth consumption problem that has been identified as the major obstacle for full-text retrieval in P2P networks, we truncate posting lists associated with indexing features to a constant size storing only top-k ranked document references. To compensate for the loss of information caused by the truncation, we extend the set of indexing features with carefully chosen term sets. Indexing term sets are selected based on the query statistics extracted from query logs, thus we index only such combinations that are a) frequently present in user queries and b) non-redundant w.r.t the rest of the index. The distributed index is compact and efficient as it constantly evolves adapting to the current query popularity distribution. Moreover, it is possible to control the tradeoff between the storage/bandwidth requirements and the quality of query answering by tuning the indexing parameters. Our theoretical analysis and experimental results indicate that we can indeed achieve scalable P2P text retrieval for very large document collections and deliver good retrieval performance.

Managing collaborative feedback information for distributed retrieval

Despite the many research efforts invested recently in peerto-peer search engines, none of the pr... more Despite the many research efforts invested recently in peerto-peer search engines, none of the proposed system has reached the level of quality and efficiency of their centralized counterpart. One of the main reasons for this inferior performance is the difficulty to attract a critical mass of users that would make the peer-to-peer system truly competitive. We argue that decentralized search mechanisms should not aim at replacing existing engines, but instead complement them by adding novel functionalities that would be difficult to provide in a centralized manner. This paper introduces an example of such a complementary search mechanism and presents the design of a distributed collaborative system for leveraging user feedback and document/user profiling information.

Scalable P2P Search Engine using Highly Discriminative Keys

International Conference on Data Engineering, 2007

The suitability of Peer-to-Peer (P2P) approaches for fulltext web retrieval has recently been que... more The suitability of Peer-to-Peer (P2P) approaches for fulltext web retrieval has recently been questioned because of the claimed unacceptable bandwidth consumption induced by retrieval from very large document collections. In this contribution we formalize a novel indexing/retrieval model that achieves high performance, costefficient retrieval by indexing with highly discriminative keys (HDKs) stored in a distributed global index maintained in a structured P2P network. HDKs correspond to carefully selected terms and term sets appearing in a small number of collection documents. We provide a theoretical analysis of the scalability of our retrieval model and report experimental results obtained with our HDK-based P2P retrieval engine. These results show that, despite increased indexing costs, the total traffic generated with the HDK approach is significantly smaller than the one obtained with distributed single-term indexing strategies. Furthermore, our experiments show that the retrieval performance obtained with a random set of real queries is comparable to the one of centralized, single-term solution using the best state-of-the-art BM25 relevance computation scheme. Finally, our scalability analysis demonstrates that the HDK approach can scale to large networks of peers indexing web-size document collections, thus opening the way towards viable, truly-decentralized web retrieval.

Opportunities from open source search

Internet search has a strong business model that permits a free service to users, so it is diffic... more Internet search has a strong business model that permits a free service to users, so it is difficult to see why, if at all, there should be open source offerings as well. This paper first discusses open source search, and a rationale for the computer science community at large to get involved. Because there is no shortage of core open source components for at least some of the tasks involved, the Alvis Consortium is building infrastructure for open source search engines using peer-to-peer and subject specific technology as its core, based on this rationale. We view open source search as a rich future playground in which information extraction and retrieval components can be used and intelligent agents can operate.

Syntactic Processing

International audienceEfficient processing of speech and language is required at all levels in th... more International audienceEfficient processing of speech and language is required at all levels in the design of human-computer interfaces. In this perspective, the book provides a global understanding of the required theoretical foundations, as well as practical examples of successful applications, in the area of human-language technology. The authors start from acoustic signal processing to pragmatics, covering all the important aspects of speech and language processing such as phonetics, morphology, syntax, and semantics. Throughout the volume, the reader can easily notice an emerging methodology, a key issue in the rational design of efficient and robust language-based computer applications. While engineering rigor is guaranteed in all chapters, particular care has been taken in highlighting intuitive aspects of technical details

Capturing and Analyzing User Behavior in Large Digital Libraries

The size of digital libraries is increasing, making navigation and access to information more cha... more The size of digital libraries is increasing, making navigation and access to information more challenging. Improving the system by observing the users' activities can help at providing better services to users of very large digital libraries. In this paper we explain how the Invenio open-source software, used by the CERN Document Server (CDS) allows fine grained logging of user behavior. In the first phase, the sequence of actions performed by users of CDS is captured, while in the second phase statistical data is calculated offline. This paper explains these two steps and the results. Although the analyzed system focuses on the high energy physics literature, the process could be applicable to other scientific communities, with and international, large user base.

Natural Language Queries on Natural Language Data: a Database of Meeting Dialogues

Applications of Natural Language to Data Bases, 2003

This paper describes an integrated system that enables the storage and retrieval of meeting trans... more This paper describes an integrated system that enables the storage and retrieval of meeting transcripts (e.g. staff meetings). The system gives users who have not attended a meeting, or who want to review a particular point, enhanced access to an annotated version of the recorded data. This paper describes the various stages in the processing, storage and query of the data. First, we put forward the idea of shallow dialogue processing, in order to extract significant features of the meeting transcriptions for storage in a database, whose structure is briefly outlined. Low-level access to the database is provided as a Web service, which can be connected to several interfaces. A description of how multimodal input can be used with VoiceXML is also provided, thus offering an easy solution for voice and web based access to the dialogue data. The paper ends with considerations about the available data and its use in the current version of the system.

Towards an argumentative coding scheme for annotating meeting dialogue data

This paper reports on the main issues arisen during the development and test of a coding scheme f... more This paper reports on the main issues arisen during the development and test of a coding scheme for the argumentative annotation of meeting discussions. A corpus of meeting discussions has been collected in the framework of a research project on multimodal dialogue analysis and a coding scheme has been proposed. Annotations have been gathered by a set of annotators with different skills in argumentative discourse analysis and the reliability of the coding schema has been assessed against standard statistical measures.

Grammaires a Substitution d"Arbres Polynomiales et Discriminantes

Evaluation of Scientific and Technological Innovation using Statistical Analysis of Patents

In this paper we tackle the problem of the multidimensional analysis of patents based on the use ... more In this paper we tackle the problem of the multidimensional analysis of patents based on the use of textual and statistical analysis techniques. The use of correspondence and cluster analysis permit to identify technological trends and innovation. Furthermore the interactions between the different fields of activities are captured through the use of these statistical methods. Also indicators based on patents can be produced in order to depict in a quantitative way the technological activity in a European level. Finally here are presented the different steps required for the textual and statistical analysis of patent data. Résumé Ce papier présente une étude statistique textuelle multidimentionnelle d'une base de brevets. L'utilisation de l'analyse des correspondances permet d'identifier les grandes tendances innovatrices. De plus, les relations entre divers champs d'activité peuvent être mises en évidence au moyens de techniques de classification. Plusieurs indicateurs basés sur l'analyse de cette base de brevets peuvent être produits et décrire de façon quantitative l'activité technologique au niveau européen. L'article détaille les différentes étapes de l'analyse textuelle et statistique de la base de brevets.

Systematic definition and assent to eContracts for Web Services

People are increasingly using provider services through the Internet. While a web site provides i... more People are increasingly using provider services through the Internet. While a web site provides information about the contract terms and conditions that the clients have to assent to in order to use its services, in web services there is no such way for taking legal issues into account. There are some attempts to build machine readable eContract languages that can be used to express the contractual terms between the participants but they are mainly designed to govern the distribution and use of electronic content. We propose an architecture for the definition of and assent to eContracts for Web Services.

Improving Text representations through Probabilistic Integration of Synonymy Relations

INtegrating SPEech acoustic and linguistic Constraints: Baseline System Development

Using Bibliographic Knowledge for Ranking in Scientific Publication Databases

Joint Conference on Knowledge-Based Software Engineering, Jun 29, 2008

ABSTRACT Document ranking for scientific publications involves a variety of specialized resources... more ABSTRACT Document ranking for scientific publications involves a variety of specialized resources (e.g. author or citation indexes) that are usually difficult to use within standard general purpose search engines that usually operate on large-scale heterogeneous document collections for which the required specialized resources are not always available for all the documents present in the collections. Integrating such resources into specialized information retrieval engines is therefore important to cope with community-specific user expectations that strongly influence the perception of relevance within the considered community. In this perspective, this paper extends the notion of ranking with various methods exploiting different types of bibliographic knowledge that represent a crucial resource for measuring the relevance of scientific publications. In our work, we experimentally evaluated the adequacy of two such ranking methods (one based on freshness, i.e. the publication date, and the other on a novel index, the download-Hirsch index, based on download frequencies) for information retrieval from the CERN scientific publication database in the domain of particle physics. Our experiments show that (i) the considered specialized ranking methods indeed represent promising candidates for extending the base line ranking (relying on the download frequency), as they both lead to fairly small search result overlaps; and (ii) that extending the base line ranking with the specialized ranking method based on freshness significantly improves the quality of the retrieval: 16.2% of relative increase for the Mean Reciprocal Rank (resp. 5.1% of relative increase for the Success@10, i.e. the estimated probability of finding at least one relevant document among the top ten retrieved) when a local rank sum is used for aggregation. We plan to further validate the presented results by carrying out additional experiments wi Our experiments show that (i) the considered specialized ranking methods indeed represent promising candidates for extending the base line ranking (relying on the download frequency), as they both lead to fairly small search result overlaps; and (ii) that extending the base line ranking with the specialized ranking method based on freshness significantly improves the quality of the retrieval: 16.2% of relative increase for the Mean Reciprocal Rank (resp. 5.1% of relative increase for the Success@10, i.e. the estimated probability of finding at least one relevant document among the top ten retrieved) when a local rank sum is used for aggregation. We plan to further validate the presented results by carrying out additional experiments with the specialized ranking method based on the download-Hirsch index to further improve the performance of our aggregative approach.

Using Information Extraction to Classify Newspapers Advertisements

This paper presents a text classification procedure that has been developed in the context of an ... more This paper presents a text classification procedure that has been developed in the context of an information extraction project. In the prototype that has been developed for this project, newspaper advertisements are processed by three main modules: first of all, a classification module associates a category to the advertisement. Then, a tagging module identifies textual information units that are related to the associated category, and finally a predefined form for that category is filled with the tagged text. The classification module, which is the main focus of this paper, consists in using a naive Bayes classifier and, at the same time, trying to fill all the predefined forms associated with all categories. Results of both methods (classification probabilities and filling scores) are then combined to provide a final classification decision. This mixed classification method is described and evaluated on the basis of concrete experiments carried out on real data. The purpose of the presented experiments is to precisely evaluate the impact of the information extraction step on classification accuracy. As one could reasonably expect, classification relying on information extraction alone doesn't perform very well but when used as a complement to the statistical approach it significantly improves the classification results.

Robust stochastic parsing: Comparing and combining two approaches for processing extra-grammatical sentences

This paper compares two techniques for robust parsing of extragrammatical natural language. Both ... more This paper compares two techniques for robust parsing of extragrammatical natural language. Both are based on well-known approaches; one selects the optimal combination of partial analyses, the other relaxes grammar rules. Both techniques use a stochastic parser to select the "best" solution among multiple analyses. Experimental results show that regardless of the grammar, the best results are obtained by sequentially combining the two techniques, by first relaxing the rules and only when that fails by then selecting a combination of partial analyses.

Polynominal tree-substitution grammars: an efficient framework for Data-Oriented Parsing

Recent Advances in Natural Language Processing, 2001

Finding the most probable parse tree in the framework of Data-Oriented Parsing (DOP), a Stochasti... more Finding the most probable parse tree in the framework of Data-Oriented Parsing (DOP), a Stochastic Tree Substitution Parsing scheme developed by R. Bod (Bod 92), has proven to be NP-hard in the most general case (Sima'an 96a). However, introducing some a priori restrictions on the choice of the elementary trees (i.e. grammar rules) leads to interesting DOP instances with polynomial time-complexity. The purpose of this paper is to present such an instance, based on the minimal-maximal selection principle, and to evaluate its performances on two different corpora.

X-Score: Automatic Evaluation of Machine Translation Grammaticality

Language Resources and Evaluation, May 1, 2006

In this paper we report an experiment of an automated metric used to analyze the grammaticality o... more In this paper we report an experiment of an automated metric used to analyze the grammaticality of machine translation output. The approach (Rajman, Hartley, 2001) is based on the distribution of the linguistic information within a translated text, which is supposed similar between a learning corpus and the translation. This method is quite inexpensive, since it does not need any reference translation. First we describe the experimental method and the different tests we used. Then we show the promising results we obtained on the CESTA 1 data, and how they correlate well with human judgments.

Extraction stochastique d"arbres d"analyse pour le modele DOP

Dans le cadre des approches à base de grammaires faiblement sensibles au contexte 1 , cette contr... more Dans le cadre des approches à base de grammaires faiblement sensibles au contexte 1 , cette contribution passe en revue le problème de l'extraction de l'arbre d'analyse le plus probable dans le modèle du Data-Oriented Parsing (DOP) (Bod, 1995). Une démonstration formelle de l'utilisabilité des méthodes Monte-Carlo est donnée, puis une technique d'échantillonnage contrôlée est développée permettant de garantir que l'arbre d'analyse sélectionné est bien (avec un certain seuil de confiance fixé a priori) l'arbre d'analyse le plus probable au sens du modèle DOP. 1. « mildly context-sensitive grammars » 2. pour « Stochastic Context-Free Grammars » 3. et ceci malgré la convention consistant à réécrire systématiquement en premier la feuille non-terminale la plus à gauche.

We present Alvis peers, a full-text P2P retrieval engine designed to offer retrieval performance ... more We present Alvis peers, a full-text P2P retrieval engine designed to offer retrieval performance comparable to centralized solutions while scaling to a very large number of peers. It is the result of our research efforts within the project Alvis 1 that aims at building a truly-distributed semantic search engine. To cope with problem of unscalable bandwidth consumption in the P2P network, the engine implements a novel retrieval model that indexes highlydiscriminative keys (HDKs)-terms and term sets appearing in a limited number of collection documents. Our prototype is a fully-functional retrieval engine built over a structured P2P network. It includes a component for HDK-based indexing and retrieval, and a distributed content-based ranking module. Such an integrated system represents a substantial contribution to the design and development of realistic P2P retrieval systems.

Query-driven indexing for peer-to-peer text retrieval

We describe a query-driven indexing framework for scalable text retrieval over structured P2P net... more We describe a query-driven indexing framework for scalable text retrieval over structured P2P networks. To cope with the bandwidth consumption problem that has been identified as the major obstacle for full-text retrieval in P2P networks, we truncate posting lists associated with indexing features to a constant size storing only top-k ranked document references. To compensate for the loss of information caused by the truncation, we extend the set of indexing features with carefully chosen term sets. Indexing term sets are selected based on the query statistics extracted from query logs, thus we index only such combinations that are a) frequently present in user queries and b) non-redundant w.r.t the rest of the index. The distributed index is compact and efficient as it constantly evolves adapting to the current query popularity distribution. Moreover, it is possible to control the tradeoff between the storage/bandwidth requirements and the quality of query answering by tuning the indexing parameters. Our theoretical analysis and experimental results indicate that we can indeed achieve scalable P2P text retrieval for very large document collections and deliver good retrieval performance.

Managing collaborative feedback information for distributed retrieval

Despite the many research efforts invested recently in peerto-peer search engines, none of the pr... more Despite the many research efforts invested recently in peerto-peer search engines, none of the proposed system has reached the level of quality and efficiency of their centralized counterpart. One of the main reasons for this inferior performance is the difficulty to attract a critical mass of users that would make the peer-to-peer system truly competitive. We argue that decentralized search mechanisms should not aim at replacing existing engines, but instead complement them by adding novel functionalities that would be difficult to provide in a centralized manner. This paper introduces an example of such a complementary search mechanism and presents the design of a distributed collaborative system for leveraging user feedback and document/user profiling information.

Scalable P2P Search Engine using Highly Discriminative Keys

International Conference on Data Engineering, 2007

The suitability of Peer-to-Peer (P2P) approaches for fulltext web retrieval has recently been que... more The suitability of Peer-to-Peer (P2P) approaches for fulltext web retrieval has recently been questioned because of the claimed unacceptable bandwidth consumption induced by retrieval from very large document collections. In this contribution we formalize a novel indexing/retrieval model that achieves high performance, costefficient retrieval by indexing with highly discriminative keys (HDKs) stored in a distributed global index maintained in a structured P2P network. HDKs correspond to carefully selected terms and term sets appearing in a small number of collection documents. We provide a theoretical analysis of the scalability of our retrieval model and report experimental results obtained with our HDK-based P2P retrieval engine. These results show that, despite increased indexing costs, the total traffic generated with the HDK approach is significantly smaller than the one obtained with distributed single-term indexing strategies. Furthermore, our experiments show that the retrieval performance obtained with a random set of real queries is comparable to the one of centralized, single-term solution using the best state-of-the-art BM25 relevance computation scheme. Finally, our scalability analysis demonstrates that the HDK approach can scale to large networks of peers indexing web-size document collections, thus opening the way towards viable, truly-decentralized web retrieval.

Opportunities from open source search

Internet search has a strong business model that permits a free service to users, so it is diffic... more Internet search has a strong business model that permits a free service to users, so it is difficult to see why, if at all, there should be open source offerings as well. This paper first discusses open source search, and a rationale for the computer science community at large to get involved. Because there is no shortage of core open source components for at least some of the tasks involved, the Alvis Consortium is building infrastructure for open source search engines using peer-to-peer and subject specific technology as its core, based on this rationale. We view open source search as a rich future playground in which information extraction and retrieval components can be used and intelligent agents can operate.

Syntactic Processing

International audienceEfficient processing of speech and language is required at all levels in th... more International audienceEfficient processing of speech and language is required at all levels in the design of human-computer interfaces. In this perspective, the book provides a global understanding of the required theoretical foundations, as well as practical examples of successful applications, in the area of human-language technology. The authors start from acoustic signal processing to pragmatics, covering all the important aspects of speech and language processing such as phonetics, morphology, syntax, and semantics. Throughout the volume, the reader can easily notice an emerging methodology, a key issue in the rational design of efficient and robust language-based computer applications. While engineering rigor is guaranteed in all chapters, particular care has been taken in highlighting intuitive aspects of technical details

Capturing and Analyzing User Behavior in Large Digital Libraries

The size of digital libraries is increasing, making navigation and access to information more cha... more The size of digital libraries is increasing, making navigation and access to information more challenging. Improving the system by observing the users' activities can help at providing better services to users of very large digital libraries. In this paper we explain how the Invenio open-source software, used by the CERN Document Server (CDS) allows fine grained logging of user behavior. In the first phase, the sequence of actions performed by users of CDS is captured, while in the second phase statistical data is calculated offline. This paper explains these two steps and the results. Although the analyzed system focuses on the high energy physics literature, the process could be applicable to other scientific communities, with and international, large user base.

Natural Language Queries on Natural Language Data: a Database of Meeting Dialogues

Applications of Natural Language to Data Bases, 2003

This paper describes an integrated system that enables the storage and retrieval of meeting trans... more This paper describes an integrated system that enables the storage and retrieval of meeting transcripts (e.g. staff meetings). The system gives users who have not attended a meeting, or who want to review a particular point, enhanced access to an annotated version of the recorded data. This paper describes the various stages in the processing, storage and query of the data. First, we put forward the idea of shallow dialogue processing, in order to extract significant features of the meeting transcriptions for storage in a database, whose structure is briefly outlined. Low-level access to the database is provided as a Web service, which can be connected to several interfaces. A description of how multimodal input can be used with VoiceXML is also provided, thus offering an easy solution for voice and web based access to the dialogue data. The paper ends with considerations about the available data and its use in the current version of the system.

Towards an argumentative coding scheme for annotating meeting dialogue data

This paper reports on the main issues arisen during the development and test of a coding scheme f... more This paper reports on the main issues arisen during the development and test of a coding scheme for the argumentative annotation of meeting discussions. A corpus of meeting discussions has been collected in the framework of a research project on multimodal dialogue analysis and a coding scheme has been proposed. Annotations have been gathered by a set of annotators with different skills in argumentative discourse analysis and the reliability of the coding schema has been assessed against standard statistical measures.

Grammaires a Substitution d"Arbres Polynomiales et Discriminantes

Evaluation of Scientific and Technological Innovation using Statistical Analysis of Patents

In this paper we tackle the problem of the multidimensional analysis of patents based on the use ... more In this paper we tackle the problem of the multidimensional analysis of patents based on the use of textual and statistical analysis techniques. The use of correspondence and cluster analysis permit to identify technological trends and innovation. Furthermore the interactions between the different fields of activities are captured through the use of these statistical methods. Also indicators based on patents can be produced in order to depict in a quantitative way the technological activity in a European level. Finally here are presented the different steps required for the textual and statistical analysis of patent data. Résumé Ce papier présente une étude statistique textuelle multidimentionnelle d'une base de brevets. L'utilisation de l'analyse des correspondances permet d'identifier les grandes tendances innovatrices. De plus, les relations entre divers champs d'activité peuvent être mises en évidence au moyens de techniques de classification. Plusieurs indicateurs basés sur l'analyse de cette base de brevets peuvent être produits et décrire de façon quantitative l'activité technologique au niveau européen. L'article détaille les différentes étapes de l'analyse textuelle et statistique de la base de brevets.

Systematic definition and assent to eContracts for Web Services

People are increasingly using provider services through the Internet. While a web site provides i... more People are increasingly using provider services through the Internet. While a web site provides information about the contract terms and conditions that the clients have to assent to in order to use its services, in web services there is no such way for taking legal issues into account. There are some attempts to build machine readable eContract languages that can be used to express the contractual terms between the participants but they are mainly designed to govern the distribution and use of electronic content. We propose an architecture for the definition of and assent to eContracts for Web Services.

Improving Text representations through Probabilistic Integration of Synonymy Relations

INtegrating SPEech acoustic and linguistic Constraints: Baseline System Development

Using Bibliographic Knowledge for Ranking in Scientific Publication Databases

Joint Conference on Knowledge-Based Software Engineering, Jun 29, 2008

ABSTRACT Document ranking for scientific publications involves a variety of specialized resources... more ABSTRACT Document ranking for scientific publications involves a variety of specialized resources (e.g. author or citation indexes) that are usually difficult to use within standard general purpose search engines that usually operate on large-scale heterogeneous document collections for which the required specialized resources are not always available for all the documents present in the collections. Integrating such resources into specialized information retrieval engines is therefore important to cope with community-specific user expectations that strongly influence the perception of relevance within the considered community. In this perspective, this paper extends the notion of ranking with various methods exploiting different types of bibliographic knowledge that represent a crucial resource for measuring the relevance of scientific publications. In our work, we experimentally evaluated the adequacy of two such ranking methods (one based on freshness, i.e. the publication date, and the other on a novel index, the download-Hirsch index, based on download frequencies) for information retrieval from the CERN scientific publication database in the domain of particle physics. Our experiments show that (i) the considered specialized ranking methods indeed represent promising candidates for extending the base line ranking (relying on the download frequency), as they both lead to fairly small search result overlaps; and (ii) that extending the base line ranking with the specialized ranking method based on freshness significantly improves the quality of the retrieval: 16.2% of relative increase for the Mean Reciprocal Rank (resp. 5.1% of relative increase for the Success@10, i.e. the estimated probability of finding at least one relevant document among the top ten retrieved) when a local rank sum is used for aggregation. We plan to further validate the presented results by carrying out additional experiments wi Our experiments show that (i) the considered specialized ranking methods indeed represent promising candidates for extending the base line ranking (relying on the download frequency), as they both lead to fairly small search result overlaps; and (ii) that extending the base line ranking with the specialized ranking method based on freshness significantly improves the quality of the retrieval: 16.2% of relative increase for the Mean Reciprocal Rank (resp. 5.1% of relative increase for the Success@10, i.e. the estimated probability of finding at least one relevant document among the top ten retrieved) when a local rank sum is used for aggregation. We plan to further validate the presented results by carrying out additional experiments with the specialized ranking method based on the download-Hirsch index to further improve the performance of our aggregative approach.

Using Information Extraction to Classify Newspapers Advertisements

This paper presents a text classification procedure that has been developed in the context of an ... more This paper presents a text classification procedure that has been developed in the context of an information extraction project. In the prototype that has been developed for this project, newspaper advertisements are processed by three main modules: first of all, a classification module associates a category to the advertisement. Then, a tagging module identifies textual information units that are related to the associated category, and finally a predefined form for that category is filled with the tagged text. The classification module, which is the main focus of this paper, consists in using a naive Bayes classifier and, at the same time, trying to fill all the predefined forms associated with all categories. Results of both methods (classification probabilities and filling scores) are then combined to provide a final classification decision. This mixed classification method is described and evaluated on the basis of concrete experiments carried out on real data. The purpose of the presented experiments is to precisely evaluate the impact of the information extraction step on classification accuracy. As one could reasonably expect, classification relying on information extraction alone doesn't perform very well but when used as a complement to the statistical approach it significantly improves the classification results.

Robust stochastic parsing: Comparing and combining two approaches for processing extra-grammatical sentences

This paper compares two techniques for robust parsing of extragrammatical natural language. Both ... more This paper compares two techniques for robust parsing of extragrammatical natural language. Both are based on well-known approaches; one selects the optimal combination of partial analyses, the other relaxes grammar rules. Both techniques use a stochastic parser to select the "best" solution among multiple analyses. Experimental results show that regardless of the grammar, the best results are obtained by sequentially combining the two techniques, by first relaxing the rules and only when that fails by then selecting a combination of partial analyses.

Polynominal tree-substitution grammars: an efficient framework for Data-Oriented Parsing

Recent Advances in Natural Language Processing, 2001

Finding the most probable parse tree in the framework of Data-Oriented Parsing (DOP), a Stochasti... more Finding the most probable parse tree in the framework of Data-Oriented Parsing (DOP), a Stochastic Tree Substitution Parsing scheme developed by R. Bod (Bod 92), has proven to be NP-hard in the most general case (Sima'an 96a). However, introducing some a priori restrictions on the choice of the elementary trees (i.e. grammar rules) leads to interesting DOP instances with polynomial time-complexity. The purpose of this paper is to present such an instance, based on the minimal-maximal selection principle, and to evaluate its performances on two different corpora.

X-Score: Automatic Evaluation of Machine Translation Grammaticality

Language Resources and Evaluation, May 1, 2006

In this paper we report an experiment of an automated metric used to analyze the grammaticality o... more In this paper we report an experiment of an automated metric used to analyze the grammaticality of machine translation output. The approach (Rajman, Hartley, 2001) is based on the distribution of the linguistic information within a translated text, which is supposed similar between a learning corpus and the translation. This method is quite inexpensive, since it does not need any reference translation. First we describe the experimental method and the different tests we used. Then we show the promising results we obtained on the CESTA 1 data, and how they correlate well with human judgments.

Extraction stochastique d"arbres d"analyse pour le modele DOP

Dans le cadre des approches à base de grammaires faiblement sensibles au contexte 1 , cette contr... more Dans le cadre des approches à base de grammaires faiblement sensibles au contexte 1 , cette contribution passe en revue le problème de l'extraction de l'arbre d'analyse le plus probable dans le modèle du Data-Oriented Parsing (DOP) (Bod, 1995). Une démonstration formelle de l'utilisabilité des méthodes Monte-Carlo est donnée, puis une technique d'échantillonnage contrôlée est développée permettant de garantir que l'arbre d'analyse sélectionné est bien (avec un certain seuil de confiance fixé a priori) l'arbre d'analyse le plus probable au sens du modèle DOP. 1. « mildly context-sensitive grammars » 2. pour « Stochastic Context-Free Grammars » 3. et ceci malgré la convention consistant à réécrire systématiquement en premier la feuille non-terminale la plus à gauche.