Papers by Mohammad Al Hasan
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019
IEEE Transactions on Knowledge and Data Engineering, 2015
Frequent subgraph mining (FSM) is an important task for exploratory data analysis on graph data. ... more Frequent subgraph mining (FSM) is an important task for exploratory data analysis on graph data. Over the years, many algorithms have been proposed to solve this task. These algorithms assume that the data structure of the mining task is small enough to fit in the main memory of a computer. However, as the real-world graph data grows, both in size and quantity, such an assumption does not hold any longer. To overcome this, some graph database-centric methods have been proposed in recent years for solving FSM; however, a distributed solution using MapReduce paradigm has not been explored extensively. Since, MapReduce is becoming the defacto paradigm for computation on massive data, an efficient FSM algorithm on this paradigm is of huge demand. In this work, we propose a frequent subgraph mining algorithm called MIRAGE which uses an iterative MapReduce based framework. MIRAGE is complete as it returns all the frequent subgraphs for a given user-defined support, and it is efficient as it applies all the optimizations that the latest FSM algorithms adopt. Our experiments with real life and large synthetic datasets validate the effectiveness of MIRAGE for mining frequent subgraphs from large graph datasets. The source code of MIRAGE is available from www.cs.iupui.edu/ ∼ alhasan/software/
In systems biology study, biological networks were used to gain insights into biological systems.... more In systems biology study, biological networks were used to gain insights into biological systems. While the traditional approach to studying biological networks is based on the identification of interactions among genes or the identification of a gene set ranking according to differentially expressed gene lists, little is known about interactions between higher order biological systems, a network of gene sets. Several types of gene set network have been proposed including co-membership, linkage, and co-enrichment human gene set networks. However, to our knowledge, none of them contains directionality information. Therefore, in this study we proposed a method to construct a regulatory gene set network, a directed network, which reveals novel relationships among gene sets. A regulatory gene set network was constructed by using publicly available gene regulation data. A directed edge in regulatory gene set networks represents a regulatory relationship from one gene set to the other gen...
Increasing rates of social harm events and plethora of text data demands the need of employing te... more Increasing rates of social harm events and plethora of text data demands the need of employing text mining techniques not only to better understand their causes but also to develop optimal prevention strategies. In this work, we study three social harm issues: crime topic models, transitions into drug addiction and homicide investigation chronologies. Topic modeling for the categorization and analysis of crime report text allows for more nuanced categories of crime compared to official UCR categorizations. This study has important implications in hotspot policing. We investigate the extent to which topic models that improve coherence lead to higher levels of crime concentration. We further explore the transitions into drug addiction using Reddit data. We proposed a prediction model to classify the users' transition from casual drug discussion forum to recovery drug discussion forum and the likelihood of such transitions. Through this study we offer insights into modern drug cult...
Kidney360, 2021
The immune system governs key functions that maintain renal homeostasis through various effector ... more The immune system governs key functions that maintain renal homeostasis through various effector cells that reside in or infiltrate the kidney. These immune cells play an important role in shaping adaptive or maladaptive responses to local or systemic stress and injury. We increasingly recognize that microenvironments within the kidney are characterized by a unique distribution of immune cells, the function of which depends on this unique spatial localization. Therefore, quantitative profiling of immune cells in intact kidney tissue becomes essential, particularly at a scale and resolution that allow the detection of differences between the various “nephro-ecosystems” in health and disease. In this review, we discuss advancements in tissue cytometry of the kidney, performed through multiplexed confocal imaging and analysis using the Volumetric Tissue Exploration and Analysis (VTEA) software. We highlight how this tool has improved our understanding of the role of the immune system i...
Autonomous navigation of agricultural robot is an essential task in precision agriculture, and su... more Autonomous navigation of agricultural robot is an essential task in precision agriculture, and success of this task critically depends on accurate detection of crop rows using computer vision methodologies. This is a challenging task due to substantial natural variations in crop row images due to various factors, including, missing crops in parts of a row, high and irregular weed growth between rows, different crop growth stages, different inter-crop spacing, variation in weather condition, and lighting. The processing time of the detection algorithm also needs to be small so that the desired number of image frames from continuous video can be processed in real-time. To cope with all the above mentioned requirements, we propose a crop row detection algorithm consisting of the following three linked stages: (1) color based segmentation for differentiating crop and weed from background, (2) differentiating crop and weed pixels using clustering algorithm and (3) robust line fitting to detect crop rows. We test the proposed algorithm over a wide variety of scenarios and compare its performance against four different types of existing strategies for crop row detection. Experimental results show that the proposed algorithm perform better than the competing algorithms with reasonable accuracy. We also perform additional experiment to test the robustness of the proposed algorithm over different values of the tuning parameters and over different clustering methods, such as, KMeans, MeanShift, Agglomerative, and HDBSCAN.
2020 IEEE International Conference on Big Data (Big Data), 2020
In corn breeding, hand-measurement of ear height is a labor-intensive process, thus limiting scal... more In corn breeding, hand-measurement of ear height is a labor-intensive process, thus limiting scalability. Here we show that it is feasible to automate estimation of the average ear height of a row of corn in experimental fields used for corn breeding. For this purpose we use point pattern analysis on predicted shank-node locations extracted from video captured on uncalibrated cameras moving through a plot at a fixed height from the ground (4 feet and 2 feet). First, a convolutional neural network-based object detection system (YOLOv3) was trained to detect the ear-stalk connection point and applied to the collected videos. Detected ear position and time information from each frame were super-imposed into a point pattern and point-features were then extracted. Using ridge regression to predict the average ear height per plot, we achieved 0.772 concordance, 2.989 inches root mean squared error, and 2.263 inches mean absolute error compared with hand-measured average ear height. Featur...
System execution traces (execution logs) are traditionally used to evaluate functional properties... more System execution traces (execution logs) are traditionally used to evaluate functional properties of a software system. Prior research, however, has shown the usefulness of system execution traces in evaluating software system performance properties. Due to the complexity and verboseness of a system execution trace, however, higher-level abstractions, e.g., dataflow models are required to support such evaluation. Our current research effort therefore has focused on extending this dataflow model based system performance analysis in two folds. In one aspect, we have considered adapting the dataflow model when the system execution trace does not contain properties required to support performance analysis. In the other aspect, we have developed techniques to auto-generate the supporting dataflow model from a system execution trace. The second aspect is critical because it is hard to manually craft a dataflow model for large and complex software systems, especially distributed software s...
2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA), 2021
Detecting the source of an outbreak cluster during a pandemic like COVID-19 can provide insights ... more Detecting the source of an outbreak cluster during a pandemic like COVID-19 can provide insights into the transmission process, associated risk factors, and help contain the spread. In this work we study the problem of source detection from multiple snapshots of spreading on an arbitrary network structure. We use a spatial temporal graph convolutional network based model (SD-STGCN) to produce a source probability distribution, by fusing information from temporal and topological spaces. We perform extensive experiments using popular compartmental simulation models over synthetic networks and empirical contact networks. We also demonstrate the applicability of our approach with real COVID-19 case data.
Neuroinformatics, Jan 25, 2018
Neuroimaging genomics is an emerging field that provides exciting opportunities to understand the... more Neuroimaging genomics is an emerging field that provides exciting opportunities to understand the genetic basis of brain structure and function. The unprecedented scale and complexity of the imaging and genomics data, however, have presented critical computational bottlenecks. In this work we present our initial efforts towards building an interactive visual exploratory system for mining big data in neuroimaging genomics. A GPU accelerated browsing tool for neuroimaging genomics is created that implements the ANOVA algorithm for single nucleotide polymorphism (SNP) based analysis and the VEGAS algorithm for gene-based analysis, and executes them at interactive rates. The ANOVA algorithm is 110 times faster than the 4-core OpenMP version, while the VEGAS algorithm is 375 times faster than its 4-core OpenMP counter part. This approach lays a solid foundation for researchers to address the challenges of mining large-scale imaging genomics datasets via interactive visual exploration.
SAE Technical Paper Series, 2017
Accuracy in detecting a moving object is critical to autonomous driving or advanced driver assist... more Accuracy in detecting a moving object is critical to autonomous driving or advanced driver assistance systems (ADAS). By including the object classification from multiple sensor detections, the model of the object or environment can be identified more accurately. The critical parameters involved in improving the accuracy are the size and the speed of the moving object. All sensor data are to be used in defining a composite object representation so that it could be used for the class information in the core object's description. This composite data can then be used by a deep learning network for complete perception fusion in order to solve the detection and tracking of moving objects problem. Camera image data from subsequent frames along the time axis in conjunction with the speed and size of the object will further contribute in developing better recognition algorithms. In this paper, we present preliminary results using only camera images for detecting various objects using deep learning network, as a first step toward multi-sensor fusion algorithm development. The simulation experiments based on camera images show encouraging results where the proposed deep learning network based detection algorithm was able to detect various objects with certain degree of confidence. A laboratory experimental setup is being commissioned where three different types of sensors, a digital camera with 8 megapixel resolution, a LIDAR with 40m range, and ultrasonic distance transducer sensors will be used for multi-sensor fusion to identify the object in real-time.
16th IEEE International Symposium on Object/component/service-oriented Real-time distributed Computing (ISORC 2013), 2013
This paper presents a method and tool named the Dataflow Model Auto-Constructor (DMAC). DMAC uses... more This paper presents a method and tool named the Dataflow Model Auto-Constructor (DMAC). DMAC uses frequent-sequence mining and Dempster-Shafer theory to mine a system execution trace and reconstruct its corresponding dataflow model. Distributed system testers then use the resultant dataflow model to analyze performance properties (e.g., end-to-end response time, throughput, and service time) captured in the system execution trace. Results from applying DMAC to different case studies show that DMAC can reconstruct dataflow models that cover at most 94% of the events in the original system execution trace. Likewise, more than 2 sources of evidence are needed to reconstruct dataflow models for systems with multiple execution contexts.
Scientific Programming, 2012
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management - CIKM '13, 2013
ABSTRACT In this paper, we present QSEGMENT, a real-life query segmentation system for eCommerce ... more ABSTRACT In this paper, we present QSEGMENT, a real-life query segmentation system for eCommerce queries. QSEGMENT uses frequency data from the query log which we call buyers' data and also frequency data from product titles what we call sellers' data. We exploit the taxonomical structure of the marketplace to build domain specific frequency models. Using such an approach, QSEGMENT performs better than previously described baselines for query segmentation. Also, we perform a large scale evaluation by using an unsupervised IR metric which we refer to as user-intent-score. We discuss the overall architecture of QSEGMENT as well as various use cases and interesting observations around segmenting eCommerce queries.
Proteins: Structure, Function, and Bioinformatics, 2007
We describe an efficient method for partial complementary shape matching for use in rigid protein... more We describe an efficient method for partial complementary shape matching for use in rigid protein–protein docking. The local shape features of a protein are represented using boolean data structures called Context Shapes. The relative orientations of the receptor and ligand surfaces are searched using precalculated lookup tables. Energetic quantities are derived from shape complementarity and buried surface area computations, using efficient boolean operations. Preliminary results indicate that our context shapes approach outperforms state‐of‐the‐art geometric shape‐based rigid‐docking algorithms. Proteins 2008. © 2007 Wiley‐Liss, Inc.
Journal of Biomedicine and Biotechnology, 2009
Multispectral three-dimensional (3D) imaging provides spatial information for biological structur... more Multispectral three-dimensional (3D) imaging provides spatial information for biological structures that cannot be measured by traditional methods. This work presents a method of tracking 3D biological structures to quantify changes over time using graph theory. Cell-graphs were generated based on the pairwise distances, in 3D-Euclidean space, between nuclei during collagen I gel compaction. From these graphs quantitative features are extracted that measure both the global topography and the frequently occurring local structures of the “tissue constructs.” The feature trends can be controlled by manipulating compaction through cell density and are significant when compared to random graphs. This work presents a novel methodology to track a simple 3D biological event and quantitatively analyze the underlying structural change. Further application of this method will allow for the study of complex biological problems that require the quantification of temporal-spatial information in 3...
Abstract—Various recent studies show that network biomarkers that consider biological networks or... more Abstract—Various recent studies show that network biomarkers that consider biological networks or pathways, instead of individual proteins, yield more reliable results and higher accuracy in cancer screening, diagnosis and treatment customization. However, the existing approaches for finding network biomarkers are ad-hoc and they do not offer an option to search the large network space to obtain a small set of significant network biomarkers. In this work, we present a graph mining based approach for a systematic discovery of network ...
2019 IEEE International Conference on Big Data (Big Data)
Due to its universal applications in the domain of social network analysis, e-commerce, and recom... more Due to its universal applications in the domain of social network analysis, e-commerce, and recommendation systems, the task of link prediction has received enormous attention from the data mining and machine learning communities over the last decade. In its original setting, the task only predicts whether a pair of entities who are not connected at present time will form a connection in future. However, in real-life an entity sometimes join a group (or a community), thus making a connection with the group (or the community), instead of connecting with an individual. Existing solutions to link prediction are inadequate for solving this prediction task. To overcome this challenge, in this work we propose a novel problem named group link prediction which focuses on evaluating the likelihood for a candidate to become a member of a group at a given time. The problem has potential applications such as friendship or group suggestions on Facebook or other social networks, as well as co-authorship suggestion, or group email recommendations. To solve the problem, we propose a Long Short-term Memory based model that inputs the embedding vectors of the group and outputs the conditional probability distributions for the candidates. We also introduce a composite long short-term memory model that integrates keyword information. Experimental results on realworld data sets validate the superiority of our proposed model in comparison to various baseline methods.
arXiv (Cornell University), Mar 13, 2022
Query similarity prediction task is generally solved by regression based models with square loss.... more Query similarity prediction task is generally solved by regression based models with square loss. Such a model is agnostic of absolute similarity values and it penalizes the regression error at all ranges of similarity values at the same scale. However, to boost e-commerce platform's monetization, it is important to predict high-level similarity more accurately than low-level similarity, as highly similar queries retrieves items according to user-intents, whereas moderately similar item retrieves related items, which may not lead to a purchase. Regression models fail to customize its loss function to concentrate around the high-similarity band, resulting poor performance in query similarity prediction task. We address the above challenge by considering the query prediction as an ordinal regression problem, and thereby propose a model, ORDSIM (ORDinal Regression for SIMilarity Prediction). ORDSIM exploits variable-width buckets to model ordinal loss, which penalizes errors in high-level similarity harshly, and thus enable the regression model to obtain better prediction results for high similarity values. We evaluate ORDSIM on a dataset of over 10 millions e-commerce queries from eBay platform and show that ORDSIM achieves substantially smaller prediction error compared to the competing regression methods on this dataset.
2019 IEEE International Conference on Big Data (Big Data), 2019
Since the emergence of the Silk Road market in the early 2010s, dark web 'cryptomarkets' have pro... more Since the emergence of the Silk Road market in the early 2010s, dark web 'cryptomarkets' have proliferated and offered people an online platform to buy and sell illicit drugs, relying on cryptocurrencies such as Bitcoin for anonymous transactions. However, recent studies have highlighted the potential for de-anonymization of bitcoin transactions, bringing into question the level of anonymity afforded by cryptomarkets. We examine a set of over 100,000 product reviews from several cryptomarkets collected in 2018 and 2019 and conduct a comprehensive analysis of the markets, including an examination of the distribution of drug sales and revenue among vendors, and a comparison of incidences of opioid sales to overdose deaths in a US city. We explore the potential for de-anonymization of vendors by implementing a Naïve-Bayes classifier to predict the vendor from a given product review, and attempt to link vendors' sales to specific Bitcoin transactions. On the buyer side, we evaluate the efficacy of hierarchical agglomerative clustering for grouping together transactions corresponding to the same buyer. We find that the high degree of specialization among the small subset of high-revenue vendors may render these vendors susceptible to de-anonymization. Further research is necessary to confirm these findings, which are restricted by the scarcity of ground-truth data for validation.
Uploads
Papers by Mohammad Al Hasan