Papers by CARLOS ARMANDO VARGAS ROJAS
Proceedings of the 2006 SIAM International Conference on Data Mining, 2006
We present a new approach for tracking evolving and noisy data streams by estimating clusters bas... more We present a new approach for tracking evolving and noisy data streams by estimating clusters based on density, while taking into account the possibility of the presence of an unknown amount of outliers, the emergence of new patterns, and the forgetting of old patterns.
Proceedings of the 2007 SIAM International Conference on Data Mining, 2007
In this paper, we study the behavior of collaborative filtering based recommendations under evolv... more In this paper, we study the behavior of collaborative filtering based recommendations under evolving user profile scenarios. We propose a systematic validation methodology that allows for simulating various controlled user profile evolution scenarios and validating the studied recommendation strategies. Through the presented work, we observe the effect of the curse of dimensionality and sparsity that can wreck havoc on collaborative filtering in a streaming scenario, and conclude that a hybrid approach with both content and collaborative filtering may be the way to go in a high sparsity streaming scenario.
2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, 2009
We present a generic framework to evaluate patterns obtained from transactional web data streams ... more We present a generic framework to evaluate patterns obtained from transactional web data streams whose underlying distribution changes with time. The evolving nature of the data makes it very difficult to determine whether there is structure in the data stream, and whether this structure is being learned. This challenge arises in applications such as mining online store transactions, summarizing dynamic
The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03.
Many real world problems are dynamic in nature, and they deal with changing environments or objec... more Many real world problems are dynamic in nature, and they deal with changing environments or objective functions. Dynamic objective functions can make the evolutionary search tedious or unsuccessful for Genetic Algorithms. Some work has focused on altering the evolutionary process, including the selection strategy, genetic operators, replacement strategy, or fitness modification. While other work focused on the concept of genotype to phenotype mapping or gene expression. This line of work includes models based on diploidy and dominance, messy GAs, proportional GA, overlapping genes such as in DNA coding method, the floating point representation, and the structured GA. In particular, the structured GA uses a simple structured hierarchical chromosome representation, where lower level genes are collectively switched on or off by specific higher level genes. Genes that are switched on are expressed into the final phenotype, while genes that are switched off do not contribute to coding the phenotype. We have recently proposed a modification of the sGA based on the concept of soft activation mechanism. The lower level genes are no longer limited to total expression or to none. Instead, they can be expressed to different gradual degrees. The soft structured Genetic Algorithm (s/sup 2/GA) inherits all the advantages of its crisp (non-fuzzy) counterpart (sGA), and possesses several additional unique features compared to the sGA and other GA based techniques. In this paper, we empirically demonstrate several strengths of the S/sup 2/GA approach with regard to non-stationary objective function optimization.
2011 IEEE First International Conference on Healthcare Informatics, Imaging and Systems Biology, 2011
ABSTRACT The healthcare industry as a whole lags far behind other industries in terms of knowledg... more ABSTRACT The healthcare industry as a whole lags far behind other industries in terms of knowledge discovery capabilities. There are many piece-wise approaches to analysis of patient records. Unfortunately, there are few approaches that enable a completely automated approach that supports not just search, but also discovery and prediction of patient health. The work presented here describes a computational framework that provides near complete automation of the discovery and trending of patient characteristics. This approach has been successfully applied to the domain of mammography, but could be applied to other domains of radiology with minimal effort.
ABSTRACT Many system-level studies involve the simulation of power systems with growing penetrati... more ABSTRACT Many system-level studies involve the simulation of power systems with growing penetration of power electronic devices and loads at distribution level. This gives rise to the need for fast models of static switching converters and efficient simulation tools. The so-called dynamic average-value models (AVMs), which use averaging techniques by neglecting switching, are widely used and appear to be very promising for the system-level transient studies of power-electronic loads. However, considering converter losses and predicting the energy efficiency remains challenging. This paper presents a methodology to include converter losses in AVMs of medium power active front-end (AFE) rectifier loads. The proposed new method is applied to a two-level voltage source converter (VSC) operating under direct power control (DPC). The conducted simulation results demonstrate the new methodology and confirm its accuracy in predicting conduction and switching losses and efficiency, while providing significant computational savings compared to the detailed switching model.
International Conference on Information and Knowledge Management, Proceedings, 2006
The increasing expansion of websites and their web usage necessitates increasingly scalable techn... more The increasing expansion of websites and their web usage necessitates increasingly scalable techniques for Web usage mining that can be better cast within the framework of mining evolving data streams [1, 5]. Despite recent developments in mining evolving Web clickstreams [3, 6], there has not been any investigation of the performance of collaborative filtering [2] in the demanding environment of evolving data streams. In this paper, we study limited memory collaborative filtering based recommendations in evolving scenarios using ...
Lecture Notes in Computer Science, 2004
Data mining has recently attracted attention as a set of efficient techniques that can discover p... more Data mining has recently attracted attention as a set of efficient techniques that can discover patterns from huge data. More recent advancements in collecting massive evolving data streams created a crucial need for dynamic data mining. In this paper, we present a genetic algorithm based on a new representation mechanism, that allows several phenotypes to be simultaneously expressed to different degrees in the same chromosome. This gradual multiple expression mechanism can offer a simple model for a multiploid representation with self-adaptive dominance, including co-dominance and incomplete dominance. Based on this model, we also propose a data mining approach that considers the data as a reflection of a dynamic environment, and investigate a new evolutionary approach based on continuously mining non-stationary data sources that do not fit in main memory. Preliminary experiments are performed on real Web clickstream data
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 2005
While scalable data mining methods are expected to cope with massive Web data, coping with evolvi... more While scalable data mining methods are expected to cope with massive Web data, coping with evolving trends in noisy data in a continuous fashion, and without any unnecessary stoppages and reconfigurations is still an open challenge. This dynamic and single pass setting can be cast within the framework of mining evolving data streams. In this paper, we explore the task of mining mass user profiles by discovering evolving Web session clusters in a single pass with a recently proposed scalable immune based clustering approach (TECNO-STREAMS), and study the effect of the choice of different similarity measures on the mining process and on the interpretation of the mined patterns. We propose a simple similarity measure that has the advantage of explicitly coupling the precision and coverage criteria to the early learning stages, and furthermore requiring that the affinity of the data to the learned profiles or summaries be defined by the minimum of their coverage or precision, hence requiring that the learned profiles are simultaneously precise and complete, with no compromises. In our expriments, we study the task of mining evolving user profiles from Web clickstream data (web usage mining) in a single pass, and under different trend sequencing scenarios, showing that compared oto the cosine similarity measure, the proposed similarity measure explicitly based on precision and coverage allows the discovery of more correct profiles at the same precision or recall quality levels.
Lecture Notes in Computer Science, 2003
Artificial Immune System (AIS) models offer a promising approach to data analysis and pattern rec... more Artificial Immune System (AIS) models offer a promising approach to data analysis and pattern recognition. However, in order to achieve a desired learning capability (for example detecting all clusters in a dat set), current models require the storage and manipulation of a large network of B Cells (with a number often exceeding the number of data points in addition to all the pairwise links between these B Cells). Hence, current AIS models are far from being scalable, which makes them of limited use, even for medium size data sets. We propose a new scalable AIS learning approach that exhibits superior learning abilities, while at the same time, requiring modest memory and computational costs. Like the natural immune system, the strongest advantage of immune based learning compared to current approaches is expected to be its ease of adaptation in dynamic environments. We illustrate the ability of the proposed approach in detecting clusters in noisy data.
Journal of Medical Systems, 2011
As massive collections of digital health data are becoming available, the opportunities for large... more As massive collections of digital health data are becoming available, the opportunities for large-scale automated analysis increase. In particular, the widespread collection of detailed health information is expected to help realize a vision of evidence-based public health and patient-centric health care. Within such a framework for large scale health analytics we describe the transformation of a large data set of mostly unlabeled and free-text mammography data into a searchable and accessible collection, usable for analytics. We also describe several methods to characterize and analyze the data, including their temporal aspects, using information retrieval, supervised learning, and classical statistical techniques. We present experimental results that demonstrate the validity and usefulness of the approach, since the results are consistent with the known features of the data, provide novel insights about it, and can be used in specific applications. Additionally, based on the process of going from raw data to results from analysis, we present the architecture of a generic system for health analytics from clinical notes.
IEEE Transactions on Power Electronics, 2011
One of the best known control methods for three- phase active rectifiers is the so-called direct ... more One of the best known control methods for three- phase active rectifiers is the so-called direct power control (DPC). The control algorithm of the DPC is primarily based on the reg- ulation of instantaneous active and reactive power. Because the DPC method can operate without any grid voltage sensors, instan- taneous power and grid phase voltages must be estimated. The
IEEE Transactions on Power Delivery, 2007
Power transformers figure to be amongst the most costly pieces of equipment used in electrical sy... more Power transformers figure to be amongst the most costly pieces of equipment used in electrical systems. A major research effort has therefore focused on detecting failures of their insulating systems prior to unexpected machine outage. Although several industrial methods exist for the online and offline monitoring of power transformers, all of them are expensive and complex, and require the use of specific electronic instrumentation. For these reasons, this paper will present online analysis of transformer leakage flux as an efficient alternative procedure for assessing machine integrity and detecting the presence of insulating failures during their earliest stages. A 12-kVA 400-V/400-V power transformer was specifically manufactured for the study. A finite-element model of the machine was designed to obtain the transient distribution of leakage flux lines in the machine's transversal section under normal operating conditions and when shorted turns are intentionally produced. Very cheap and simple sensors, based on air-core coils, were built in order to measure the leakage flux of the transformer, and nondestructive tests were also applied to the machine in order to analyze pre and postfailure voltages induced in the coils. Results point to the ability to detect very early stages of failure, as well as locating the position of the shorted turn in the transformer windings.
Computer Networks, 2006
The expanding and dynamic nature of the Web poses enormous challenges to most data mining techniq... more The expanding and dynamic nature of the Web poses enormous challenges to most data mining techniques that try to extract patterns from Web data, such as Web usage and Web content. While scalable data mining methods are expected to cope with the size challenge, coping with evolving trends in noisy data in a continuous fashion, and without any unnecessary stoppages and reconfigurations is still an open challenge. This dynamic and single pass setting can be cast within the framework of mining evolving data streams. The harsh restrictions imposed by the ''you only get to see it once'' constraint on stream data calls for different computational models that may furthermore bring some interesting surprises when it comes to the behavior of some well known similarity measures during clustering, and even validation. In this paper, we study the effect of similarity measures on the mining process and on the interpretation of the mined patterns in the harsh single pass requirement scenario. We propose a simple similarity measure that has the advantage of explicitly coupling the precision and coverage criteria to the early learning stages. Even though the cosine similarity, and its close relative such as the Jaccard measure, have been prevalent in the majority of Web data clustering approaches, they may fail to explicitly seek profiles that achieve high coverage and high precision simultaneously. We also formulate a validation strategy and adapt several metrics rooted in information retrieval to the challenging task of validating a learned stream synopsis in dynamic environments. Our experiments confirm that the performance of the MinPC similarity is generally better than the cosine similarity, and that this outperformance can be expected to be more pronounced for data sets that are more challenging in terms of the amount of noise and/or overlap, and in terms of the level of change in the underlying profiles/topics (known sub-categories of the input data) as the input stream unravels. In our simulations, we study the task of mining and tracking trends and profiles in evolving text and Web usage data streams in a single pass, and under different trend sequencing scenarios.
Data mining has recently attracted attention as a set of efficient techniques that can discover p... more Data mining has recently attracted attention as a set of efficient techniques that can discover patterns from huge data. More recent advancements in collecting massive evolving data streams created a crucial need for dynamic data mining. In this paper, we present a genetic algorithm based on a new representation mechanism that allows several phenotypes to be simultaneously expressed to different degrees in the same chromosome. This gradual multiple expression mechanism can offer a simple model for a multiploid representation with self-adaptive dominance, including co-dominance and incomplete dominance. Based on this model, we also propose a data mining approach that considers the data as a reflection of a dynamic environment and investigate a new evolutionary approach based on continuously mining non-stationary data sources that do not fit in main memory. Preliminary experiments are performed on real Web clickstream data.
In this paper, an image search tool that combines keyword and image content feature querying and ... more In this paper, an image search tool that combines keyword and image content feature querying and search is presented. The developed search tool tries to bridge the gap between commercial search engines, which are based on keyword search, and CBIR (Content Based Image Retrieval) systems developed mostly in the academic field, designed to search based on image content. The tool is implemented by building on and extending the open source text-based search engine Nutch and its powerful Lucene based crawling and indexing capabilities. Several user friendly search options are provided to allow users to query the index using not only words, but also by showing an image example, as well as image feature descriptions. Even though we evaluate the developed tool by running a set of controlled experiments on the COREL' 5000 image database, the developed search tool is able to crawl images from the World Wide Web at a larger scale.
Proc. of WebKDD, Aug 1, 2003
Web usage mining has recently attracted attention as a viable framework for extracting useful acc... more Web usage mining has recently attracted attention as a viable framework for extracting useful access pattern information, such as user profiles, from massive amounts of Web log data for the purpose of Web site personalization and organization. These efforts have relied mainly on clustering or association rule discovery as the enabling data mining technologies. Typically, data mining has to be completely re-applied periodically and offline on newly generated Web server logs in order to keep the discovered knowledge up to ...
Revista Técnica de la …, 2010
... Carlos Rojas 1 , Nancy Rincón 2 , Altamira Díaz 2 , Gilberto Colina 2 , Elisabeth Behling 2 ,... more ... Carlos Rojas 1 , Nancy Rincón 2 , Altamira Díaz 2 , Gilberto Colina 2 , Elisabeth Behling 2 , Elsa Chacín 2 y Nola Fernández 2. ... The Flotation system was used for the treatment of production oil waters, from a separator API located at the Patio Tanks ULE, in Tía Juana Zulia ...
Uploads
Papers by CARLOS ARMANDO VARGAS ROJAS