... These include conceptual exercises, which help to clarify some of the more challenging concep... more ... These include conceptual exercises, which help to clarify some of the more challenging concepts in data mining, and tiny data set exercises, which challenge the reader to apply the ... Daniel T. Larose, Ph.D. Director, Data Mining @CCSU www.ccsu.edu/datamining
Plus d'un million de titres à notre catalogue ! ... Date de parution : 12-2002 Langue : ANGL... more Plus d'un million de titres à notre catalogue ! ... Date de parution : 12-2002 Langue : ANGLAIS 636p. 16x24 Hardback Disponible chez l'éditeur (délai d'approvisionnement : 10 jours). ... A complete end-to-end source of information on almost all aspects of parallel computing ...
Recently, a number of researchers have investigated a class of graph partitioning algorithms that... more Recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . From the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. We investigate the effectiveness of many different choices for all three phases: coarsening, partition of the coarsest graph, and refinement. In particular, we present a new coarsening heuristic (called heavyedge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. We also present a much faster variation of the Kernighan-Lin algorithm for refining during uncoarsening. We test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, VLSI, and transportation. Our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. Also, when our scheme is used to compute fill reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm.
Clustering techniques have been used by many intelligent software agents in order to retrieve, lt... more Clustering techniques have been used by many intelligent software agents in order to retrieve, lter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of related web documents to automatically formulate queries and search for other similar documents on the Web. Traditional clustering algorithms either use a priori knowledge of document structures to de ne a distance or similarity among these documents, or use probabilistic techniques such as Bayesian classi cation. Many of these traditional algorithms, however, falter when the dimensionality of the feature space becomes high relative to the size of the document space. In this paper, we introduce two new clustering algorithms that can e ectively cluster documents, even in the presence of a very high dimensional feature space. These clustering techniques, which are based on generalizations of graph partitioning, do not require pre-speci ed ad hoc distance functions, and are capable of automatically discovering document similarities or associations. We conduct several experiments on real Web data using various feature selection heuristics, and compare our clustering schemes to standard distance-based techniques, such as hierarchical agglomeration clustering, and Bayesian classi cation methods, such as AutoClass.
Traditional graph partitioning algorithms compute a k-way partitioning of a graph such that the n... more Traditional graph partitioning algorithms compute a k-way partitioning of a graph such that the number of edges that are cut by the partitioning is minimized and each partition has an equal number of vertices. The task of minimizing the edge-cut can be considered as the objective and the requirement that the partitions will be of the same size can be considered as the constraint. In this paper we extend the partitioning problem by incorporating an arbitrary number of balancing constraints. In our formulation, a vector of weights is assigned to each vertex, and the goal is to produce a k-way partitioning such that the partitioning satisfies a balancing constraint associated with each weight, while attempting to minimize the edge-cut. Applications of this multi-constraint graph partitioning problem include parallel solution of multi-physics and multi-phase computations, that underly many existing and emerging large-scale scientific simulations. We present new multi-constraint graph partitioning algorithms that are based on the multilevel graph partitioning paradigm. Our work focuses on developing new types of heuristics for coarsening, initial partitioning, and refinement that are capable of successfully handling multiple constraints. We experimentally evaluate the effectiveness of our multi-constraint partitioners on a a variety of synthetically generated problems.
We present WebACE, an agent for exploring and categorizing documents on the World Wide Web based ... more We present WebACE, an agent for exploring and categorizing documents on the World Wide Web based on a user pro le. The heart of the agent is an unsupervised categorization of a set of documents, combined with a process for generating new queries that is used to search for new related documents and for ltering the resulting documents to extract the ones most closely related to the starting set. The document categories are not given a priori. We present the overall architecture and describe two novel algorithms which provide signi cant improvement over Hierarchical Agglomeration Clustering and AutoClass algorithms and form the basis for the query generation and search component of the agent. We report on the results of our experiments comparing these new algorithms with more traditional clustering algorithms and we show that our algorithms are fast and scalable. y Authors are listed alphabetically.
Journal of Parallel and Distributed Computing, 1998
In this paper we present and study a class of graph partitioning algorithms that reduce the size ... more In this paper we present and study a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, find a k-way partitioning of the smaller graph, and then uncoarsen and refine it to construct a k-way partitioning for the original graph. These algorithms compute a k-way partitioning of a graph G = (V, E) in O(|E|) time which is faster by a factor of O(log k) than previously proposed multilevel recursive bisection algorithms. A key contribution of our work is in finding a high quality and computationally inexpensive refinement algorithm that can improve upon an initial k-way partitioning. We also study the effectiveness of the overall scheme for a variety of coarsening schemes.
This paper presents the results of an experimental study of some common document clustering techn... more This paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a "standard" K-means algorithm and a variant of K-means, "bisecting" K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to "get the best of both worlds." However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these results that is based on an analysis of the specifics of the clustering algorithms and the nature of document data.
Page 1. Introduction to parallel Computing PB Sunil Kumar Department of Physics IIT Madras, Chenn... more Page 1. Introduction to parallel Computing PB Sunil Kumar Department of Physics IIT Madras, Chennai 600036 www.physics.iitm.ac.in/~sunil Friday 19 November 2010 Page 2. What is high performance computing ? Large volume numerical calculations :Set up facilites ...
In this paper we present a parallel formulation of a multilevel k-way graph partitioning algorith... more In this paper we present a parallel formulation of a multilevel k-way graph partitioning algorithm. The multilevel k-way partitioning algorithm reduces the size of the graph by collapsing vertices and edges (coarsening phase), finds a k-way partition of the smaller graph, and then it constructs a k-way partition for the original graph by projecting and refining the partition to successively finer graphs (uncoarsening phase). A key innovative feature of our parallel formulation is that it utilizes graph coloring to effectively parallelize both the coarsening and the refinement during the uncoarsening phase. Our algorithm is able to achieve a high degree of concurrency, while maintaining the high quality partitions produced by the serial algorithm. We test our scheme on a large number of graphs from finite element methods, and transportation domains. Our parallel formulation on Cray T3D, produces high quality 128-way partitions on 128 processors in a little over two seconds, for graphs with a million vertices. Thus our parallel algorithm makes it possible to perform dynamic graph partition in adaptive computations without compromising quality.
Clustering of data in a large dimension space is of a great interest in many data mining applicat... more Clustering of data in a large dimension space is of a great interest in many data mining applications. In this paper, we propose a method for clustering of data in a high dimensional space based on a hypergraph model. In this method, the relationship present in the original data in high dimensional space are mapped into a hypergraph. A hyperedge represents a relationship (a nity) among subsets of data and the weight of the hyperedge re ects the strength of this a nity. A hypergraph partitioning algorithm is used to nd a partitioning of the vertices such that the corresponding data items in each partition are highly related and the weight of the hyperedges cut by the partitioning is minimized. We present results of experiments on two di erent data sets: S&P500 stock data for the period of 1994-1996 and protein coding data. These experiments demonstrate that our approach is applicable and e ective in high dimensional datasets.
In this paper, we present a new multilevel k-way hypergraph partitioning algorithm that substanti... more In this paper, we present a new multilevel k-way hypergraph partitioning algorithm that substantially outperforms the existing state-of-the-art K-PM/LFI algorithm for multiway partitioning, both for optimizing local as well as global objectives. Experiments on the ISPD98 benchmark suite show that the partitionings produced by our scheme are on the average 15% to 23% better than those produced by the K-PM/LF! algorithm, both in terms of the hyperedge cut as well as the (K-1) metric. Furthermore, our algorithm is significantly faster, requiring 4 to 5 times less time than that required by K-PM/LFI. 285 286 G. KARYPIS AND V. KUMAR sequence of successively smaller (coarser) hypergraphs is constructed. A bisection of the smallest
We propose an agent for exploring and categorizing documents on the World Wide Web. The heart of ... more We propose an agent for exploring and categorizing documents on the World Wide Web. The heart of the agent is an automatic categorization of a set of documents, combined with a process for generating new queries used to search for new related documents and ltering the resulting documents to extract the set of documents most closely related to the starting set. The document categories are not given a-priori. We present the overall architecture and describe two novel algorithms which provide signi cant improvement over traditional clustering algorithms and form the basis for the query generation and search component of the agent.
... Search: The ACM Digital Library The Guide. Feedback. Introduction to parallel computing: desi... more ... Search: The ACM Digital Library The Guide. Feedback. Introduction to parallel computing: design and analysis of algorithms. Purchase this Book. Source, Pages: 498. Year of Publication: 1994. ISBN:0-8053-3170-0. Authors, Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis ...
Journal of Parallel and Distributed Computing, 1994
In this paper we analyze the scalability of a number of load balancing algorithms which can be ap... more In this paper we analyze the scalability of a number of load balancing algorithms which can be applied to problems that have the following characteristics : the work done by a processor can be partitioned into independent work pieces; the work pieces are of highly variable sizes; and it is not possible (or very di cult) to estimate the size of total work at a given processor. Such problems require a load balancing scheme that distributes the work dynamically among di erent processors.
Many techniques for association rule mining and feature selection require a suitable metric to ca... more Many techniques for association rule mining and feature selection require a suitable metric to capture the dependencies among variables in a data set. For example, metrics such as support, confidence, lift, correlation, and collective strength are often used to determine the interestingness of association patterns. However, many such measures provide conflicting information about the interestingness of a pattern, and the best metric to use for a given application domain is rarely known. In this paper, we present an overview of various measures proposed in the statistics, machine learning and data mining literature. We describe several key properties one should examine in order to select the right measure for a given application domain. A comparative study of these properties is made using twenty one of the existing measures. We show that each measure has different properties which make them useful for some application domains, but not for others. We also present two scenarios in which most of the existing measures agree with each other, namely, support-based pruning and table standardization. Finally, we present an algorithm to select a small set of tables such that an expert can select a desirable measure by looking at just this small set of tables.
... These include conceptual exercises, which help to clarify some of the more challenging concep... more ... These include conceptual exercises, which help to clarify some of the more challenging concepts in data mining, and tiny data set exercises, which challenge the reader to apply the ... Daniel T. Larose, Ph.D. Director, Data Mining @CCSU www.ccsu.edu/datamining
Plus d'un million de titres à notre catalogue ! ... Date de parution : 12-2002 Langue : ANGL... more Plus d'un million de titres à notre catalogue ! ... Date de parution : 12-2002 Langue : ANGLAIS 636p. 16x24 Hardback Disponible chez l'éditeur (délai d'approvisionnement : 10 jours). ... A complete end-to-end source of information on almost all aspects of parallel computing ...
Recently, a number of researchers have investigated a class of graph partitioning algorithms that... more Recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . From the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. We investigate the effectiveness of many different choices for all three phases: coarsening, partition of the coarsest graph, and refinement. In particular, we present a new coarsening heuristic (called heavyedge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. We also present a much faster variation of the Kernighan-Lin algorithm for refining during uncoarsening. We test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, VLSI, and transportation. Our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. Also, when our scheme is used to compute fill reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm.
Clustering techniques have been used by many intelligent software agents in order to retrieve, lt... more Clustering techniques have been used by many intelligent software agents in order to retrieve, lter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of related web documents to automatically formulate queries and search for other similar documents on the Web. Traditional clustering algorithms either use a priori knowledge of document structures to de ne a distance or similarity among these documents, or use probabilistic techniques such as Bayesian classi cation. Many of these traditional algorithms, however, falter when the dimensionality of the feature space becomes high relative to the size of the document space. In this paper, we introduce two new clustering algorithms that can e ectively cluster documents, even in the presence of a very high dimensional feature space. These clustering techniques, which are based on generalizations of graph partitioning, do not require pre-speci ed ad hoc distance functions, and are capable of automatically discovering document similarities or associations. We conduct several experiments on real Web data using various feature selection heuristics, and compare our clustering schemes to standard distance-based techniques, such as hierarchical agglomeration clustering, and Bayesian classi cation methods, such as AutoClass.
Traditional graph partitioning algorithms compute a k-way partitioning of a graph such that the n... more Traditional graph partitioning algorithms compute a k-way partitioning of a graph such that the number of edges that are cut by the partitioning is minimized and each partition has an equal number of vertices. The task of minimizing the edge-cut can be considered as the objective and the requirement that the partitions will be of the same size can be considered as the constraint. In this paper we extend the partitioning problem by incorporating an arbitrary number of balancing constraints. In our formulation, a vector of weights is assigned to each vertex, and the goal is to produce a k-way partitioning such that the partitioning satisfies a balancing constraint associated with each weight, while attempting to minimize the edge-cut. Applications of this multi-constraint graph partitioning problem include parallel solution of multi-physics and multi-phase computations, that underly many existing and emerging large-scale scientific simulations. We present new multi-constraint graph partitioning algorithms that are based on the multilevel graph partitioning paradigm. Our work focuses on developing new types of heuristics for coarsening, initial partitioning, and refinement that are capable of successfully handling multiple constraints. We experimentally evaluate the effectiveness of our multi-constraint partitioners on a a variety of synthetically generated problems.
We present WebACE, an agent for exploring and categorizing documents on the World Wide Web based ... more We present WebACE, an agent for exploring and categorizing documents on the World Wide Web based on a user pro le. The heart of the agent is an unsupervised categorization of a set of documents, combined with a process for generating new queries that is used to search for new related documents and for ltering the resulting documents to extract the ones most closely related to the starting set. The document categories are not given a priori. We present the overall architecture and describe two novel algorithms which provide signi cant improvement over Hierarchical Agglomeration Clustering and AutoClass algorithms and form the basis for the query generation and search component of the agent. We report on the results of our experiments comparing these new algorithms with more traditional clustering algorithms and we show that our algorithms are fast and scalable. y Authors are listed alphabetically.
Journal of Parallel and Distributed Computing, 1998
In this paper we present and study a class of graph partitioning algorithms that reduce the size ... more In this paper we present and study a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, find a k-way partitioning of the smaller graph, and then uncoarsen and refine it to construct a k-way partitioning for the original graph. These algorithms compute a k-way partitioning of a graph G = (V, E) in O(|E|) time which is faster by a factor of O(log k) than previously proposed multilevel recursive bisection algorithms. A key contribution of our work is in finding a high quality and computationally inexpensive refinement algorithm that can improve upon an initial k-way partitioning. We also study the effectiveness of the overall scheme for a variety of coarsening schemes.
This paper presents the results of an experimental study of some common document clustering techn... more This paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a "standard" K-means algorithm and a variant of K-means, "bisecting" K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to "get the best of both worlds." However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these results that is based on an analysis of the specifics of the clustering algorithms and the nature of document data.
Page 1. Introduction to parallel Computing PB Sunil Kumar Department of Physics IIT Madras, Chenn... more Page 1. Introduction to parallel Computing PB Sunil Kumar Department of Physics IIT Madras, Chennai 600036 www.physics.iitm.ac.in/~sunil Friday 19 November 2010 Page 2. What is high performance computing ? Large volume numerical calculations :Set up facilites ...
In this paper we present a parallel formulation of a multilevel k-way graph partitioning algorith... more In this paper we present a parallel formulation of a multilevel k-way graph partitioning algorithm. The multilevel k-way partitioning algorithm reduces the size of the graph by collapsing vertices and edges (coarsening phase), finds a k-way partition of the smaller graph, and then it constructs a k-way partition for the original graph by projecting and refining the partition to successively finer graphs (uncoarsening phase). A key innovative feature of our parallel formulation is that it utilizes graph coloring to effectively parallelize both the coarsening and the refinement during the uncoarsening phase. Our algorithm is able to achieve a high degree of concurrency, while maintaining the high quality partitions produced by the serial algorithm. We test our scheme on a large number of graphs from finite element methods, and transportation domains. Our parallel formulation on Cray T3D, produces high quality 128-way partitions on 128 processors in a little over two seconds, for graphs with a million vertices. Thus our parallel algorithm makes it possible to perform dynamic graph partition in adaptive computations without compromising quality.
Clustering of data in a large dimension space is of a great interest in many data mining applicat... more Clustering of data in a large dimension space is of a great interest in many data mining applications. In this paper, we propose a method for clustering of data in a high dimensional space based on a hypergraph model. In this method, the relationship present in the original data in high dimensional space are mapped into a hypergraph. A hyperedge represents a relationship (a nity) among subsets of data and the weight of the hyperedge re ects the strength of this a nity. A hypergraph partitioning algorithm is used to nd a partitioning of the vertices such that the corresponding data items in each partition are highly related and the weight of the hyperedges cut by the partitioning is minimized. We present results of experiments on two di erent data sets: S&P500 stock data for the period of 1994-1996 and protein coding data. These experiments demonstrate that our approach is applicable and e ective in high dimensional datasets.
In this paper, we present a new multilevel k-way hypergraph partitioning algorithm that substanti... more In this paper, we present a new multilevel k-way hypergraph partitioning algorithm that substantially outperforms the existing state-of-the-art K-PM/LFI algorithm for multiway partitioning, both for optimizing local as well as global objectives. Experiments on the ISPD98 benchmark suite show that the partitionings produced by our scheme are on the average 15% to 23% better than those produced by the K-PM/LF! algorithm, both in terms of the hyperedge cut as well as the (K-1) metric. Furthermore, our algorithm is significantly faster, requiring 4 to 5 times less time than that required by K-PM/LFI. 285 286 G. KARYPIS AND V. KUMAR sequence of successively smaller (coarser) hypergraphs is constructed. A bisection of the smallest
We propose an agent for exploring and categorizing documents on the World Wide Web. The heart of ... more We propose an agent for exploring and categorizing documents on the World Wide Web. The heart of the agent is an automatic categorization of a set of documents, combined with a process for generating new queries used to search for new related documents and ltering the resulting documents to extract the set of documents most closely related to the starting set. The document categories are not given a-priori. We present the overall architecture and describe two novel algorithms which provide signi cant improvement over traditional clustering algorithms and form the basis for the query generation and search component of the agent.
... Search: The ACM Digital Library The Guide. Feedback. Introduction to parallel computing: desi... more ... Search: The ACM Digital Library The Guide. Feedback. Introduction to parallel computing: design and analysis of algorithms. Purchase this Book. Source, Pages: 498. Year of Publication: 1994. ISBN:0-8053-3170-0. Authors, Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis ...
Journal of Parallel and Distributed Computing, 1994
In this paper we analyze the scalability of a number of load balancing algorithms which can be ap... more In this paper we analyze the scalability of a number of load balancing algorithms which can be applied to problems that have the following characteristics : the work done by a processor can be partitioned into independent work pieces; the work pieces are of highly variable sizes; and it is not possible (or very di cult) to estimate the size of total work at a given processor. Such problems require a load balancing scheme that distributes the work dynamically among di erent processors.
Many techniques for association rule mining and feature selection require a suitable metric to ca... more Many techniques for association rule mining and feature selection require a suitable metric to capture the dependencies among variables in a data set. For example, metrics such as support, confidence, lift, correlation, and collective strength are often used to determine the interestingness of association patterns. However, many such measures provide conflicting information about the interestingness of a pattern, and the best metric to use for a given application domain is rarely known. In this paper, we present an overview of various measures proposed in the statistics, machine learning and data mining literature. We describe several key properties one should examine in order to select the right measure for a given application domain. A comparative study of these properties is made using twenty one of the existing measures. We show that each measure has different properties which make them useful for some application domains, but not for others. We also present two scenarios in which most of the existing measures agree with each other, namely, support-based pruning and table standardization. Finally, we present an algorithm to select a small set of tables such that an expert can select a desirable measure by looking at just this small set of tables.
Uploads
Papers by vipin kumar