Earlier work suggests that mixture-distance can improve the performance of feature-based face rec... more Earlier work suggests that mixture-distance can improve the performance of feature-based face recognition systems in which only a single training example is available for each individual. In this work we investigate the non-feature-based Eigenfaces technique of Turk and Pentland, replacing Euclidean distance with mixture-distance. In mixture-distance, a novel distance function is constructed based on local second-order statistics as estimated by modeling the training data with a mixture of normal densities. The approach is described and experimental results on a database of 600 people are presented, showing that mixture-distance can reduce the error rate by up to 73.9%. In the experimental setting considered, the results indicate that the simplest form of mixture distance yields considerable improvement. Additional, but less dramatic, improvement was possible with more complex forms. The results show that even in the absence of multiple training examples for each class, it is someti...
Inductive learning methods, such as neural networks and decision trees, have become a popular app... more Inductive learning methods, such as neural networks and decision trees, have become a popular approach to developing DNA sequence identification tools. Such methods attempt to form models of a collection of training data that can be used to predict future data accurately. The common approach to using such methods on DNA sequence identification problems forms models that depend on the absolute locations of nucleotides and assume independence of consecutive nucleotide locations. This paper describes a new class of learning methods, called compression-based induction (CBI), that is geared towards sequence learning problems such as those that arise when learning DNA sequences. The central idea is to use text compression techniques on DNA sequences as the means for generalizing from sample sequences. The resulting methods form models that are based on the more important relative locations of nucleotides and on the dependence of consecutive locations. They also provide a suitable framewor...
This paper introduces a method of implementing secure cryptosystems which use short secret keys, ... more This paper introduces a method of implementing secure cryptosystems which use short secret keys, as short as 40 bits or less. The use of short keys has the advantage that keys can be readily memorized and do not need to be written down or stored electronically. Nonetheless, our short key cryptographic protocols provide security comparable to the security provided by conventional cryptosystems with longer keys.
We consider the computational problem of finding nearest neighbors in general metric spaces. Of p... more We consider the computational problem of finding nearest neighbors in general metric spaces. Of particular interest are spaces that may not be conveniently embedded or approximated in Euclidian space, or where the dimensionality of a Euclidian representation is very high. Also relevant are high-dimensional Euclidian settings in which the distribution of data is in some sense of lower dimension and embedded in the space. The vp-tree (vantage point tree) is introduced in several forms, together with associated algorithms, as an improved method for these difficult search problems. Tree construction executes in O(n log(n)) time, and search is under certain circumstances and in the limit, O(log(n)) expected time. The theoretical basis for this approach is developed and the results of several experiments are reported. In Euclidian cases, kd-tree performance is compared.
Pex is a preprocessor and a build management system for Pyrex, or Cython. Among other things, Pex... more Pex is a preprocessor and a build management system for Pyrex, or Cython. Among other things, Pex adds the ability to conveniently write C fast numerics using numpy.ndarray, frees you from Makefiles and header files, and makes your Pyrex classes serializable, through both pickling and a faster scheme. To the user, Pex looks like a programming language that is much like Python, but with additional syntax, through which it can be made to run as fast as C in settings all the way from small numerical loops to large scale systems. Pyrex is a Python to C compiler with the added functionality of C fast function calls, an object system with C fast attribute access and method invocation, and extra syntax that allows mixing-in of C code. Cython is a close cousin to Pyrex, which adds many convenient features. Most of the functionality of the language is in Pyrex, and its author Greg Ewing deserves the majority of the credit. Code snippets are given like so: 1 # l i v e s in f i l e "main. px" 2 print " hello world " and unless otherwise specified are assumed to live in the file main.px (.px means a Pex file).
| In this paper we demonstrate that two common metrics, symmetric set diierence, and Eu-clidian d... more | In this paper we demonstrate that two common metrics, symmetric set diierence, and Eu-clidian distance, have normalized forms which are nevertheless metrics. The rst of these jA4Bj=jABj is easily established and generalizes to measure spaces. The second applies to vectors in R n and is given by kX?Y k=(kXk+kY k). That this is a metric is more dii-cult to demonstrate and is true for Euclidian distance (the L2 norm) but for no other integral Minkowski metric. In addition to providing bounded distances when no a priori data bound exists, these forms are qualitatively diierent from their unnormalized counterparts , and are therefore also distinguished from simpler range companded constructions. Mixed forms are also deened which combine absolute and relative behavior, while remaining metrics. The result is a family of forms which resemble commonly used dissimilarity statistics but obey the triangle inequality.
The library of practical abstractions (LIBPA) provides efficient implementations of conceptually ... more The library of practical abstractions (LIBPA) provides efficient implementations of conceptually simple abstractions, in the C programming language. We believe that the best library code is conceptually simple so that it will be easily understood by the application programmer; parameterized by type so that it enjoys wide applicability; and at least as efficient as a straightforward special-purpose implementation. You will find that our software satisfies the highest standards of software design, implementation, testing, and benchmarking. The current LIBPA release is a source code distribution only. It consists of modules for portable memory management, one dimensional arrays of arbitrary types, compact symbol tables, hash tables for arbitrary types, a trie module for length-delimited strings over arbitrary alphabets, single precision floating point numbers with extended exponents, and logarithmic representations of probability values using either fixed or floating point numbers. We ...
Approximate string comparison and search is an important part of applications that range from nat... more Approximate string comparison and search is an important part of applications that range from natural language to the interpretation of DNA. This paper presents a bipartite weighted graph matching approach to these problems, based on the authors’ linear time matching algorithms‡. Our approach’s tolerance to permutation of symbols or blocks, distinguishes it from the widely used edit distance and finite state machine methods. A close relationship with the earlier related ‘proximity comparison’ method is established. Under the linear cost model, a simple O(1) time per position online algorithm is presented for comparing two strings given a fixed alignment. Heuristics are given for optimal alignment. In the approximate string search problem, one string advances in a fixed direction relative to the other with each time step. We introduce a new online algorithm for this setting which dynamically maintains an optimal bipartite weighted matching. We discuss the application of our algorithm...
We consider the computational problem of finding nearest neighbors in general metric spaces. Of p... more We consider the computational problem of finding nearest neighbors in general metric spaces. Of particular interest are spaces that may not be conveniently embedded or approximated in Euclidian space, or where the dimensionality of a Euclidian representation 1s very high. Also relevant are high-dimensional Euclidian settings in which the distribution of data is in some sense of lower dimension and embedded in the space. The up-tree (vantage point tree) is introduced in several forms, together‘ with &ciated algorithms, as an improved method for these difficult search nroblems. Tree construcI tion executes in O(nlog(n i ) time, and search is under certain circumstances and in the imit, O(log(n)) expected time. The theoretical basis for this approach is developed and the results of several experiments are reported. In Euclidian cases, kd-tree performance is compared.
Natural learners rarely have access to perfectly labeled data { motivating the study of unsupervi... more Natural learners rarely have access to perfectly labeled data { motivating the study of unsupervised learning in an attempt to assign labels. An alternative viewpoint, which avoids the issue of labels entirely, has as the learner's goal the discovery of an e ective metric with which similarity judgments can be made. We refer to this paradigm as metric learning. E ective classi cation, for example, then becomes a consequence rather than the direct purpose of learning. Consider the following setting: a database made up of exactly one observation of each of many di erent objects. This paper shows that, under admittedly strong assumptions, there exists a natural prescription for metric learning in this data starved case. Our outlook is stochastic, and the metric we learn is represented by a joint probability density estimated from the observed data. We derive a closed-form expression for the value of this density starting from an explanation of the data as a Gaussian Mixture. Our framework places two known classi cation techniques of statistical pattern recognition at opposite ends of a spectrum { and describes new intermediate possibilities. The notion of a stochastic equivalence predicate is introduced and striking di erences between its behavior and that of conventional metrics are illuminated. As a result one of the basic tenets of nearest-neighbor-based classi cation is challenged.
2017 IEEE International Conference on Computer Vision (ICCV)
Neural networks trained on datasets such as ImageNet have led to major advances in visual object ... more Neural networks trained on datasets such as ImageNet have led to major advances in visual object classification. One obstacle that prevents networks from reasoning more deeply about complex scenes and situations, and from integrating visual knowledge with natural language, like humans do, is their lack of common sense knowledge about the physical world. Videos, unlike still images, contain a wealth of detailed information about the physical world. However, most labelled video datasets represent high-level concepts rather than detailed physical aspects about actions and scenes. In this work, we describe our ongoing collection of the "something-something" database of video prediction tasks whose solutions require a common sense understanding of the depicted situation. The database currently contains more than 100,000 videos across 174 classes, which are defined as caption-templates. We also describe the challenges in crowd-sourcing this data at scale.
We propose a self-organizing archival Intermemory. That is, a noncommercial subscriber-provided d... more We propose a self-organizing archival Intermemory. That is, a noncommercial subscriber-provided distributed information storage service built on the existing Internet. Given an assumption of continued growth in the memory's total size, a subscriber's participation for only a nite time can nevertheless ensure archival preservation of the subscriber's data. Information disperses through the network over time and memories become more di cult to erase as they age. The probability of losing an old memory given random node failures is vanishingly small { and an adversary would have to corrupt hundreds of thousands of nodes to destroy a very old memory. This paper presents a framework for the design of an Intermemory, and considers certain aspects of the design in greater detail. In particular, the aspects of addressing, space e ciency, and redundant coding are discussed.
Earlier work suggests that mixture-distance can improve the performance of feature-based face rec... more Earlier work suggests that mixture-distance can improve the performance of feature-based face recognition systems in which only a single training example is available for each individual. In this work we investigate the non-feature-based Eigenfaces technique of Turk and Pentland, replacing Euclidean distance with mixture-distance. In mixture-distance, a novel distance function is constructed based on local second-order statistics as estimated by modeling the training data with a mixture of normal densities. The approach is described and experimental results on a database of 600 people are presented, showing that mixture-distance can reduce the error rate by up to 73.9%. In the experimental setting considered, the results indicate that the simplest form of mixture distance yields considerable improvement. Additional, but less dramatic, improvement was possible with more complex forms. The results show that even in the absence of multiple training examples for each class, it is someti...
Inductive learning methods, such as neural networks and decision trees, have become a popular app... more Inductive learning methods, such as neural networks and decision trees, have become a popular approach to developing DNA sequence identification tools. Such methods attempt to form models of a collection of training data that can be used to predict future data accurately. The common approach to using such methods on DNA sequence identification problems forms models that depend on the absolute locations of nucleotides and assume independence of consecutive nucleotide locations. This paper describes a new class of learning methods, called compression-based induction (CBI), that is geared towards sequence learning problems such as those that arise when learning DNA sequences. The central idea is to use text compression techniques on DNA sequences as the means for generalizing from sample sequences. The resulting methods form models that are based on the more important relative locations of nucleotides and on the dependence of consecutive locations. They also provide a suitable framewor...
This paper introduces a method of implementing secure cryptosystems which use short secret keys, ... more This paper introduces a method of implementing secure cryptosystems which use short secret keys, as short as 40 bits or less. The use of short keys has the advantage that keys can be readily memorized and do not need to be written down or stored electronically. Nonetheless, our short key cryptographic protocols provide security comparable to the security provided by conventional cryptosystems with longer keys.
We consider the computational problem of finding nearest neighbors in general metric spaces. Of p... more We consider the computational problem of finding nearest neighbors in general metric spaces. Of particular interest are spaces that may not be conveniently embedded or approximated in Euclidian space, or where the dimensionality of a Euclidian representation is very high. Also relevant are high-dimensional Euclidian settings in which the distribution of data is in some sense of lower dimension and embedded in the space. The vp-tree (vantage point tree) is introduced in several forms, together with associated algorithms, as an improved method for these difficult search problems. Tree construction executes in O(n log(n)) time, and search is under certain circumstances and in the limit, O(log(n)) expected time. The theoretical basis for this approach is developed and the results of several experiments are reported. In Euclidian cases, kd-tree performance is compared.
Pex is a preprocessor and a build management system for Pyrex, or Cython. Among other things, Pex... more Pex is a preprocessor and a build management system for Pyrex, or Cython. Among other things, Pex adds the ability to conveniently write C fast numerics using numpy.ndarray, frees you from Makefiles and header files, and makes your Pyrex classes serializable, through both pickling and a faster scheme. To the user, Pex looks like a programming language that is much like Python, but with additional syntax, through which it can be made to run as fast as C in settings all the way from small numerical loops to large scale systems. Pyrex is a Python to C compiler with the added functionality of C fast function calls, an object system with C fast attribute access and method invocation, and extra syntax that allows mixing-in of C code. Cython is a close cousin to Pyrex, which adds many convenient features. Most of the functionality of the language is in Pyrex, and its author Greg Ewing deserves the majority of the credit. Code snippets are given like so: 1 # l i v e s in f i l e "main. px" 2 print " hello world " and unless otherwise specified are assumed to live in the file main.px (.px means a Pex file).
| In this paper we demonstrate that two common metrics, symmetric set diierence, and Eu-clidian d... more | In this paper we demonstrate that two common metrics, symmetric set diierence, and Eu-clidian distance, have normalized forms which are nevertheless metrics. The rst of these jA4Bj=jABj is easily established and generalizes to measure spaces. The second applies to vectors in R n and is given by kX?Y k=(kXk+kY k). That this is a metric is more dii-cult to demonstrate and is true for Euclidian distance (the L2 norm) but for no other integral Minkowski metric. In addition to providing bounded distances when no a priori data bound exists, these forms are qualitatively diierent from their unnormalized counterparts , and are therefore also distinguished from simpler range companded constructions. Mixed forms are also deened which combine absolute and relative behavior, while remaining metrics. The result is a family of forms which resemble commonly used dissimilarity statistics but obey the triangle inequality.
The library of practical abstractions (LIBPA) provides efficient implementations of conceptually ... more The library of practical abstractions (LIBPA) provides efficient implementations of conceptually simple abstractions, in the C programming language. We believe that the best library code is conceptually simple so that it will be easily understood by the application programmer; parameterized by type so that it enjoys wide applicability; and at least as efficient as a straightforward special-purpose implementation. You will find that our software satisfies the highest standards of software design, implementation, testing, and benchmarking. The current LIBPA release is a source code distribution only. It consists of modules for portable memory management, one dimensional arrays of arbitrary types, compact symbol tables, hash tables for arbitrary types, a trie module for length-delimited strings over arbitrary alphabets, single precision floating point numbers with extended exponents, and logarithmic representations of probability values using either fixed or floating point numbers. We ...
Approximate string comparison and search is an important part of applications that range from nat... more Approximate string comparison and search is an important part of applications that range from natural language to the interpretation of DNA. This paper presents a bipartite weighted graph matching approach to these problems, based on the authors’ linear time matching algorithms‡. Our approach’s tolerance to permutation of symbols or blocks, distinguishes it from the widely used edit distance and finite state machine methods. A close relationship with the earlier related ‘proximity comparison’ method is established. Under the linear cost model, a simple O(1) time per position online algorithm is presented for comparing two strings given a fixed alignment. Heuristics are given for optimal alignment. In the approximate string search problem, one string advances in a fixed direction relative to the other with each time step. We introduce a new online algorithm for this setting which dynamically maintains an optimal bipartite weighted matching. We discuss the application of our algorithm...
We consider the computational problem of finding nearest neighbors in general metric spaces. Of p... more We consider the computational problem of finding nearest neighbors in general metric spaces. Of particular interest are spaces that may not be conveniently embedded or approximated in Euclidian space, or where the dimensionality of a Euclidian representation 1s very high. Also relevant are high-dimensional Euclidian settings in which the distribution of data is in some sense of lower dimension and embedded in the space. The up-tree (vantage point tree) is introduced in several forms, together‘ with &ciated algorithms, as an improved method for these difficult search nroblems. Tree construcI tion executes in O(nlog(n i ) time, and search is under certain circumstances and in the imit, O(log(n)) expected time. The theoretical basis for this approach is developed and the results of several experiments are reported. In Euclidian cases, kd-tree performance is compared.
Natural learners rarely have access to perfectly labeled data { motivating the study of unsupervi... more Natural learners rarely have access to perfectly labeled data { motivating the study of unsupervised learning in an attempt to assign labels. An alternative viewpoint, which avoids the issue of labels entirely, has as the learner's goal the discovery of an e ective metric with which similarity judgments can be made. We refer to this paradigm as metric learning. E ective classi cation, for example, then becomes a consequence rather than the direct purpose of learning. Consider the following setting: a database made up of exactly one observation of each of many di erent objects. This paper shows that, under admittedly strong assumptions, there exists a natural prescription for metric learning in this data starved case. Our outlook is stochastic, and the metric we learn is represented by a joint probability density estimated from the observed data. We derive a closed-form expression for the value of this density starting from an explanation of the data as a Gaussian Mixture. Our framework places two known classi cation techniques of statistical pattern recognition at opposite ends of a spectrum { and describes new intermediate possibilities. The notion of a stochastic equivalence predicate is introduced and striking di erences between its behavior and that of conventional metrics are illuminated. As a result one of the basic tenets of nearest-neighbor-based classi cation is challenged.
2017 IEEE International Conference on Computer Vision (ICCV)
Neural networks trained on datasets such as ImageNet have led to major advances in visual object ... more Neural networks trained on datasets such as ImageNet have led to major advances in visual object classification. One obstacle that prevents networks from reasoning more deeply about complex scenes and situations, and from integrating visual knowledge with natural language, like humans do, is their lack of common sense knowledge about the physical world. Videos, unlike still images, contain a wealth of detailed information about the physical world. However, most labelled video datasets represent high-level concepts rather than detailed physical aspects about actions and scenes. In this work, we describe our ongoing collection of the "something-something" database of video prediction tasks whose solutions require a common sense understanding of the depicted situation. The database currently contains more than 100,000 videos across 174 classes, which are defined as caption-templates. We also describe the challenges in crowd-sourcing this data at scale.
We propose a self-organizing archival Intermemory. That is, a noncommercial subscriber-provided d... more We propose a self-organizing archival Intermemory. That is, a noncommercial subscriber-provided distributed information storage service built on the existing Internet. Given an assumption of continued growth in the memory's total size, a subscriber's participation for only a nite time can nevertheless ensure archival preservation of the subscriber's data. Information disperses through the network over time and memories become more di cult to erase as they age. The probability of losing an old memory given random node failures is vanishingly small { and an adversary would have to corrupt hundreds of thousands of nodes to destroy a very old memory. This paper presents a framework for the design of an Intermemory, and considers certain aspects of the design in greater detail. In particular, the aspects of addressing, space e ciency, and redundant coding are discussed.
Uploads
Papers by Peter Yianilos