Papers by Ricardo Baeza-Yates
Computer, 2000
Sponsored search advertising has dramatically impacted search engines, consumers, and organizatio... more Sponsored search advertising has dramatically impacted search engines, consumers, and organizations, and will continue to do so in the foreseeable future. computer 98 IT SYSTEMS PERSPECTIVE S Published by the IEEE Computer Society
Proceedings. String Processing and Information Retrieval: A South American Symposium (Cat. No.98EX207), 1998
We present a new model to query document databases by content and structure. The main merits of t... more We present a new model to query document databases by content and structure. The main merits of the model are: it allows rich structure in the documents; the query algebra is intuitive (moreover, complemented by a visual query language) and powerful; it is e ciently implementable; it can be built on top of a traditional indexing system or even with no index at all; it is strongly oriented to user-de nable relevance ranking instead of boolean logic; and it allows exible visualization of results in terms of structure, contents and highlighting of user-de ned important parts in the query.
World Wide Web Conference Series, 2003
String Processing and Information Retrieval, 2001
Computing Research Repository, 2010
Vertical search engines focus on specific slices of content, such as the Web of a single country ... more Vertical search engines focus on specific slices of content, such as the Web of a single country or the document collection of a large corporation. Despite this, like general open web search engines, they are expensive to maintain, expensive to operate, and hard to design. Because of this, predicting the response time of a vertical search engine is usually done
Proceedings SCCC'98. 18th International Conference of the Chilean Society of Computer Science (Cat. No.98EX212), 1998
In this paper we present a model f or visualizing large collections of documents in World Wide We... more In this paper we present a model f or visualizing large collections of documents in World Wide Web retrieval, independently of t he retrieval system. Our proposal allows to ease the use of visualization tools which partially solve the problem of data o verload on Internet. We present a specific software architecture to separate the user interface from the retrieval
Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000, 2000
We analyze the dependency problem of the user interface with the information retrieval software. ... more We analyze the dependency problem of the user interface with the information retrieval software. Our approach allows the separation of the user interface from the retrieval component. This is useful when the user wants to select an interface or visualization metaphor that could not always be available for different information retrieval systems. We present a model for visualizing large collections
Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726), 2003
We present an approach for building text visualizations that avoids using plug-ins or clients bas... more We present an approach for building text visualizations that avoids using plug-ins or clients based on languages like Java. Instead we propose to make the search engine application more aware of the visualization process and use the web browser standard features to do the rendering work. We demonstrate the ideas with a text visualization metaphor implementation that is part of
Journal of Discrete Algorithms, 2000
We focus on how to compute the edit distance (or similarity) between two images and the problem o... more We focus on how to compute the edit distance (or similarity) between two images and the problem of approximate string matching in two dimensions, that is, to find a pattern of size in a text of size with at most errors (character substitutions, insertions and deletions). Pattern and text are matrices over an alphabet o f size . We present
Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '95, 1995
Permission to make digiL:~l/ll:lr~lc{jpics OFail or p:lrt of this m:~tcri:d without fee is grante... more Permission to make digiL:~l/ll:lr~lc{jpics OFail or p:lrt of this m:~tcri:d without fee is granted provided th:~l lhc c{)pics :lre m)t tn:lcie or distributed for profit or commcrci:il adwrnt:~gc, the ACM c,Jpyright/ server notice, the title of' the puhlic:](it)t~ and its ci:itc appc:w, :md notice ...
Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '07, 2007
Abstract In this paper we study the trade-offs in designing efficient caching systems for Web sea... more Abstract In this paper we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs. caching posting lists. Using a query log spanning a whole year we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the ...

Proceedings of the 21st international conference companion on World Wide Web - WWW '12 Companion, 2012
In the online world, user engagement refers to the phenomena associated with being captivated by ... more In the online world, user engagement refers to the phenomena associated with being captivated by a web application and wanting to use it longer and frequently. Nowadays, many providers operate multiple content sites, very different from each other. Due to their extremely varied content, these are usually studied and optimized separately. However, user engagement should be examined not only within individual sites, but also across sites, that is the entire content provider network. In previous work, we investigated networked user engagement, by defining a global measure of engagement that captures the effect that sites have on the engagement on other sites within the same browsing session. Here, we look at the effect of links on networked user engagement, as these are commonly used by online content providers to increase user engagement.

Proceedings of the 2nd International ICST Conference on Scalable Information Systems, 2007
To address the rapid growth of the Internet, modern Web search engines have to adopt distributed ... more To address the rapid growth of the Internet, modern Web search engines have to adopt distributed organizations, where the collection of indexed documents is partitioned among several servers, and query answering is performed as a parallel and distributed task. Collection selection can be a way to reduce the overall computing load, by finding a trade-off between the quality of results retrieved and the cost of solving queries. In this paper, we analyze the relationship between the collection selection strategy, the effect on load balancing and on the caching subsystem, by exploring the design-space of a distributed search engine based on collection selection. In particular, we propose a strategy to perform collection selection in a load-driven way, and a novel caching policy able to incrementally refine the effectiveness of the results returned for each subsequent cache hit. The combination of load-driven collection selection and incremental caching strategies allows our system to retrieve two thirds of the top-ranked results returned by a baseline centralized index, with only one fifth of the computing workload.

Lecture Notes in Computer Science, 2007
This paper studies the impact of the tail of the query distribution on caches of Web search engin... more This paper studies the impact of the tail of the query distribution on caches of Web search engines, and proposes a technique for achieving higher hit ratios compared to traditional heuristics such as LRU. The main problem we solve is the one of identifying infrequent queries, which cause a reduction on hit ratio because caching them often does not lead to hits. To mitigate this problem, we introduce a cache management policy that employs an admission policy to prevent infrequent queries from taking space of more frequent queries in the cache. The admission policy uses either stateless features, which depend only on the query, or stateful features based on usage information. The proposed management policy is more general than existing policies for caching of search engine results, and it is fully dynamic. The evaluation results on two different query logs show that our policy achieves higher hit ratios when compared to previously proposed cache management policies.
Proceeding of the 17th international conference on World Wide Web - WWW '08, 2008

Lecture Notes in Computer Science, 2008
In this paper we study privacy preservation for the publication of search engine query logs. In p... more In this paper we study privacy preservation for the publication of search engine query logs. In particular, we introduce a new privacy concern, which is that of website privacy (or business privacy). We define the possible adversaries that could be interested in disclosing website information and the vulnerabilities found in the query log, from which they could benefit. In this work we also detail anonymization techniques to protect website information, and explore the different types of attacks that an adversary could use. We then present a graph-based heuristic to validate the effectiveness of our anonymization method, and perform an experimental evaluation of this approach. Our experimental results show that the query log can be appropriately anonymized against a specific attack for website exposure, by only removing approximately 9% of the total volume of queries and clicked URLs.
... Yates and G. Navarro, Faster approximate string matching. Algorithmica 23 2 (1999), pp. 1271... more ... Yates and G. Navarro, Faster approximate string matching. Algorithmica 23 2 (1999), pp. 127158 Preliminary version in: Proc. CPM'96. View Record in Scopus | Cited By in Scopus (62). [3]. R. Baeza-Yates and C. Perleberg, Fast and practical approximate pattern matching. ...
Uploads
Papers by Ricardo Baeza-Yates