Papers by Justin Sybrandt
As the size and scope of online data continues to grow, new machine learning techniques become ne... more As the size and scope of online data continues to grow, new machine learning techniques become necessary to best capitalize on the wealth of available information. However, the models that help convert data into knowledge require nontrivial processes to make sense of large collections of text and massive online graphs. In both scenarios, modern machine learning pipelines produce embeddings-semantically rich vectors of latent features-to convert human constructs for machine understanding. In this dissertation we focus on information available within biomedical science, including human-written abstracts of scientific papers, as well as machinegenerated graphs of biomedical entity relationships. We present the Moliere system, and our method for identifying new discoveries through the use of natural language processing and graph mining algorithms. We propose heuristically-based ranking criteria to augment Moliere, and leverage this ranking to identify a new gene-treatment target for HIV-associated Neurodegenerative Disorders. We additionally focus on the latent features of graphs, and propose a new bipartite graph embedding technique. Using our graph embedding, we advance the state-of-the-art in hypergraph partitioning quality. Having newfound intuition of graph embeddings, we present Agatha, a deep-learning approach to hypothesis generation. This system learns a data-driven ranking criteria derived from the embeddings of our large proposed biomedical semantic graph. To produce human-readable results, we additionally propose CBAG, I would like to thank Ilya Safro, as well as the members of my committee, for guidance over these last four years. I am grateful to have had mentors so amenable and accomplished. I would also like to acknowledge my collaborators: Michael Shtutman on Chapters 5, 6, and 8, Angelo Carrabba on Chapter 7, Ruslan Shaydulin on Chapter 4, and Ilya Tyagin on Chapter 8. Furthermore, I would like to thank Lisa and Larry Sybrandt, my parents, for raising me to give my best to everything I do. Thank you to my siblings, Jennifer and Joseph Sybrandt, for keeping me grounded. To Marilyn and Darrel Apps, my grandparents, for providing ever-present encouragement and enthusiasm. To my grandmother, Kayleen Sybrandt, for mailing countless comic strips full of three-toed sloths and road-crossing chickens, for long conversations full of Seinfeld references, and for never-wavering love full of optimism. Lastly, I would like to thank all of my friends and colleagues, of which there are too many to list here. Your contributions, both social and scholarly, were crucial.
arXiv (Cornell University), Apr 12, 2023
We are living in a golden age of machine learning. Powerful models perform many tasks far better ... more We are living in a golden age of machine learning. Powerful models perform many tasks far better than is possible using traditional software engineering approaches alone. However, developing and deploying these models in existing software systems remains challenging. In this paper, we present SmartChoices, a novel approach to incorporating machine learning into mature software stacks easily, safely, and effectively. We highlight key design decisions and present case studies applying SmartChoices within a range of large-scale industrial systems. 1 Notably, this diverges considerably from the design philosophies of ML platforms that enable setting up arbitrary ML pipelines, such as TFX [5] & Kubeflow [15]. 2 We will refer to metrics we wish to maximize as rewards and those we wish to minimize as costs.
Typical graph embeddings may not capture type-specific bipartite graph features that arise in suc... more Typical graph embeddings may not capture type-specific bipartite graph features that arise in such areas as recommender systems, data visualization, and drug discovery. Machine learning methods utilized in these applications would be better served with specialized embedding techniques. We propose two embeddings for bipartite graphs that decompose edges into sets of indirect relationships between node neighborhoods. When sampling higher-order relationships, we reinforce similarities through algebraic distance on graphs. We also introduce ensemble embeddings to combine both into a “best of both worlds” embedding. The proposed methods are evaluated on link prediction and recommendation tasks and compared with other state-of-the-art embeddings. Our embeddings are found to perform better on recommendation tasks and equally competitive in link prediction. Although all considered embeddings are beneficial in particular applications, we demonstrate that none of those considered is clearly s...
As the size and scope of online data continues to grow, new machine learning techniques become ne... more As the size and scope of online data continues to grow, new machine learning techniques become necessary to best capitalize on the wealth of available information. However, the models that help convert data into knowledge require nontrivial processes to make sense of large collections of text and massive online graphs. In both scenarios, modern machine learning pipelines produce embeddings — semantically rich vectors of latent features — to convert human constructs for machine understanding. In this dissertation we focus on information available within biomedical science, including human-written abstracts of scientific papers, as well as machinegenerated graphs of biomedical entity relationships. We present the Moliere system, and our method for identifying new discoveries through the use of natural language processing and graph mining algorithms. We propose heuristically-based ranking criteria to augment Moliere, and leverage this ranking to identify a new gene-treatment target for...
arXiv (Cornell University), Feb 13, 2020
Medical research is risky and expensive. Drug discovery, as an example, requires that researchers... more Medical research is risky and expensive. Drug discovery, as an example, requires that researchers efficiently winnow thousands of potential targets to a small candidate set for more thorough evaluation. However, research groups spend significant time and money to perform the experiments necessary to determine this candidate set long before seeing intermediate results. Hypothesis generation systems address this challenge by mining the wealth of publicly available scientific information to predict plausible research directions. We present AGATHA, a deep-learning hypothesis generation system that can introduce data-driven insights earlier in the discovery process. Through a learned ranking criteria, this system quickly prioritizes plausible term-pairs among entity sets, allowing us to recommend new research directions. We massively validate our system with a temporal holdout wherein we predict connections first introduced after 2015 using data published beforehand. We additionally explore biomedical subdomains, and demonstrate AGATHA's predictive capacity across the twenty most popular relationship types. This system achieves best-in-class performance on an established benchmark, and demonstrates high recommendation scores across subdomains. Reproducibility: All code, experimental data, and pre-trained models are available online: sybrandt.com/2020/agatha .
arXiv (Cornell University), May 26, 2019
Typical graph embeddings may not capture type-specific bipartite graph features that arise in suc... more Typical graph embeddings may not capture type-specific bipartite graph features that arise in such areas as recommender systems, data visualization, and drug discovery. Machine learning methods utilized in these applications would be better served with specialized embedding techniques. We propose two embeddings for bipartite graphs that decompose edges into sets of indirect relationships between node neighborhoods. When sampling higher-order relationships, we reinforce similarities through algebraic distance on graphs. We also introduce ensemble embeddings to combine both into a "best of both worlds" embedding. The proposed methods are evaluated on link prediction and recommendation tasks and compared with other state-of-the-art embeddings. Our embeddings are found to perform better on recommendation tasks and equally competitive in link prediction. Although all considered embeddings are beneficial in particular applications, we demonstrate that none of those considered is clearly superior (in contrast to what is claimed in many papers). Therefore, we discuss the trade offs among them, noting that the methods proposed here are robust for applications relying on same-typed comparisons.
PLOS ONE, Jul 6, 2021
Biomedical research papers often combine disjoint concepts in novel ways, such as when describing... more Biomedical research papers often combine disjoint concepts in novel ways, such as when describing a newly discovered relationship between an understudied gene with an important disease. These concepts are often explicitly encoded as metadata keywords, such as the author-provided terms included with many documents in the MEDLINE database. While substantial recent work has addressed the problem of text generation in a more general context, applications, such as scientific writing assistants, or hypothesis generation systems, could benefit from the capacity to select the specific set of concepts that underpin a generated biomedical text. We propose a conditional language model following the transformer architecture. This model uses the "encoder stack" to encode concepts that a user wishes to discuss in the generated text. The "decoder stack" then follows the masked self-attention pattern to perform text generation, using both prior tokens as well as the encoded condition. We demonstrate that this approach provides significant control, while still producing reasonable biomedical text.
Text preprocessing is an essential step in text mining. Removing words that can negatively impact... more Text preprocessing is an essential step in text mining. Removing words that can negatively impact the quality of prediction algorithms or are not informative enough is a crucial storagesaving technique in text indexing and results in improved computational efficiency. Typically, a generic stop word list is applied to a dataset regardless of the domain. However, many common words are different from one domain to another but have no significance within a particular domain. Eliminating domainspecific common words in a corpus reduces the dimensionality of the feature space, and improves the performance of text mining tasks. In this paper, we present a novel mathematical approach for the automatic extraction of domain-specific words called the hyperplane-based approach. This new approach depends on the notion of low dimensional representation of the word in vector space and its distance from hyperplane. The hyperplanebased approach can significantly reduce text dimensionality by eliminating irrelevant features. We compare the hyperplanebased approach with other feature selection methods, namely χ 2 and mutual information. An experimental study is performed on three different datasets and five classification algorithms, and measure the dimensionality reduction and the increase in the classification performance. Results indicate that the hyperplanebased approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information. The computational time to identify the domain-specific words is significantly lower than mutual information. Reproducibility: code and results can be found at https: //github.com/FarahAlshanik/Domain-Specific-Word-List
Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researc... more Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI). These objects include but are not limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths discovered within this network, and demonstrate the effectiveness of MOLIERE by performing hypothesis generation on historical data. Our network, implementation, and resulting data are all publicly available for the broad scientific community.
arXiv (Cornell University), Jun 5, 2023
This project demonstrates how medical corpus hypothesis generation, a knowledge discovery field o... more This project demonstrates how medical corpus hypothesis generation, a knowledge discovery field of AI, can be used to derive new research angles for landscape and urban planners. The hypothesis generation approach herein consists of a combination of deep learning with topic modeling, a probabilistic approach to natural language analysis that scans aggregated research databases for words that can be grouped together based on their subject matter commonalities; the word groups accordingly form topics that can provide implicit connections between two general research terms. The hypothesis generation system AGATHA was used to identify likely conceptual relationships between emerging infectious diseases (EIDs) and deforestation, with the objective of providing landscape planners guidelines for productive research directions to help them formulate research hypotheses centered on deforestation and EIDs that will contribute to the broader health field that asserts causal roles of landscape-level issues. This research also serves as a partial proof-of-concept for the application of medical database hypothesis generation to medicine-adjacent hypothesis discovery.
arXiv (Cornell University), Apr 13, 2018
The potential for automatic hypothesis generation (HG) systems to improve research productivity k... more The potential for automatic hypothesis generation (HG) systems to improve research productivity keeps pace with the growing set of publicly available scientific information. But as data becomes easier to acquire, we must understand the effect different textual data sources have on our resulting hypotheses. Are abstracts enough for HG, or does it need full-text papers? How many papers does an HG system need to make valuable predictions? How sensitive is a general-purpose HG system to hyperparameter values or input quality? What effect does corpus size and document length have on HG results? To answer these questions we train multiple versions of knowledge networkbased HG system, MOLIERE, on varying corpora in order to compare challenges and trade offs in terms of result quality and computational requirements. MOLIERE generalizes main principles of similar knowledge network-based HG systems and reinforces them with topic modeling components. The corpora include the abstract and full-text versions of PubMed Central, as well as iterative halves of MEDLINE, which allows us to compare the effect document length and count has on the results. We find that, quantitatively, corpora with a higher median document length result in marginally higher quality results, yet require substantially longer to process. However, qualitatively, full-length papers introduce a significant number of intruder terms to the resulting topics, which decreases human interpretability. Additionally, we find that the effect of document length is greater than that of document count, even if both sets contain only paper abstracts.
arXiv (Cornell University), Sep 9, 2019
.
arXiv (Cornell University), Mar 18, 2020
Graph representation learning based on graph neural networks (GNNs) can greatly improve the perfo... more Graph representation learning based on graph neural networks (GNNs) can greatly improve the performance of downstream tasks, such as node and graph classification. However, the general GNN models do not aggregate node information in a hierarchical manner, and can miss key higher-order structural features of many graphs. The hierarchical aggregation also enables the graph representations to be explainable. In addition, supervised graph representation learning requires labeled data, which is expensive and error-prone. To address these issues, we present an unsupervised graph representation learning method, Unsupervised Hierarchical Graph Representation (UHGR), which can generate hierarchical representations of graphs. Our method focuses on maximizing mutual information between "local" and high-level "global" representations, which enables us to learn the node embeddings and graph embeddings without any labeled data. To demonstrate the effectiveness of the proposed method, we perform the node and graph classification using the learned node and graph embeddings. The results show that the proposed method achieves comparable results to state-of-the-art supervised methods on several benchmarks. In addition, our visualization of hierarchical representations indicates that our method can capture meaningful and interpretable clusters. Reproducibility: Our code and experimental data are available at this link 1. CCS CONCEPTS • Information systems → Data mining; • Mathematics of computing → Graph algorithms; • Computing methodologies → Unsupervised learning.
arXiv (Cornell University), Apr 24, 2017
Since the Agile Manifesto, many organizations have explored agile development methods to replace ... more Since the Agile Manifesto, many organizations have explored agile development methods to replace traditional waterfall development. Interestingly, waterfall remains the most widely used practice, suggesting that there is something missing from the many "flavors" of agile methodologies. We explore seven of the most common practices to explore this, and evaluate each against a series of criteria centered around product quality and adherence to agile practices. We find that no methodology entirely replaces waterfall and summarize the strengths and weaknesses of each. From this, we conclude that agile methods are, as a whole, unable to cope with the realities of technical debt and large scale systems. Ultimately, no one methodology fits all projects.
arXiv (Cornell University), Feb 20, 2017
Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researc... more Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI). These objects include but are not limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths discovered within this network, and demonstrate the effectiveness of MOLIERE by performing hypothesis generation on historical data. Our network, implementation, and resulting data are all publicly available for the broad scientific community.
PLOS ONE
Biomedical research papers often combine disjoint concepts in novel ways, such as when describing... more Biomedical research papers often combine disjoint concepts in novel ways, such as when describing a newly discovered relationship between an understudied gene with an important disease. These concepts are often explicitly encoded as metadata keywords, such as the author-provided terms included with many documents in the MEDLINE database. While substantial recent work has addressed the problem of text generation in a more general context, applications, such as scientific writing assistants, or hypothesis generation systems, could benefit from the capacity to select the specific set of concepts that underpin a generated biomedical text. We propose a conditional language model following the transformer architecture. This model uses the “encoder stack” to encode concepts that a user wishes to discuss in the generated text. The “decoder stack” then follows the masked self-attention pattern to perform text generation, using both prior tokens as well as the encoded condition. We demonstrat...
2020 IEEE International Conference on Big Data (Big Data)
Text preprocessing is an essential step in text mining. Removing words that can negatively impact... more Text preprocessing is an essential step in text mining. Removing words that can negatively impact the quality of prediction algorithms or are not informative enough is a crucial storagesaving technique in text indexing and results in improved computational efficiency. Typically, a generic stop word list is applied to a dataset regardless of the domain. However, many common words are different from one domain to another but have no significance within a particular domain. Eliminating domainspecific common words in a corpus reduces the dimensionality of the feature space, and improves the performance of text mining tasks. In this paper, we present a novel mathematical approach for the automatic extraction of domain-specific words called the hyperplane-based approach. This new approach depends on the notion of low dimensional representation of the word in vector space and its distance from hyperplane. The hyperplanebased approach can significantly reduce text dimensionality by eliminating irrelevant features. We compare the hyperplanebased approach with other feature selection methods, namely χ 2 and mutual information. An experimental study is performed on three different datasets and five classification algorithms, and measure the dimensionality reduction and the increase in the classification performance. Results indicate that the hyperplanebased approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information. The computational time to identify the domain-specific words is significantly lower than mutual information. Reproducibility: code and results can be found at https: //github.com/FarahAlshanik/Domain-Specific-Word-List
arXiv (Cornell University), May 26, 2019
Typical graph embeddings may not capture type-specific bipartite graph features that arise in suc... more Typical graph embeddings may not capture type-specific bipartite graph features that arise in such areas as recommender systems, data visualization, and drug discovery. Machine learning methods utilized in these applications would be better served with specialized embedding techniques. We propose two embeddings for bipartite graphs that decompose edges into sets of indirect relationships between node neighborhoods. When sampling higher-order relationships, we reinforce similarities through algebraic distance on graphs. We also introduce ensemble embeddings to combine both into a "best of both worlds" embedding. The proposed methods are evaluated on link prediction and recommendation tasks and compared with other state-of-the-art embeddings. Our embeddings are found to perform better on recommendation tasks and equally competitive in link prediction. Although all considered embeddings are beneficial in particular applications, we demonstrate that none of those considered is clearly superior (in contrast to what is claimed in many papers). Therefore, we discuss the trade offs among them, noting that the methods proposed here are robust for applications relying on same-typed comparisons.
ArXiv, 2020
Graph representation learning based on graph neural networks (GNNs) can greatly improve the perfo... more Graph representation learning based on graph neural networks (GNNs) can greatly improve the performance of downstream tasks, such as node and graph classification. However, the general GNN models do not aggregate node information in a hierarchical manner, and can miss key higher-order structural features of many graphs. The hierarchical aggregation also enables the graph representations to be explainable. In addition, supervised graph representation learning requires labeled data, which is expensive and error-prone. To address these issues, we present an unsupervised graph representation learning method, Unsupervised Hierarchical Graph Representation (UHGR), which can generate hierarchical representations of graphs. Our method focuses on maximizing mutual information between "local" and high-level "global" representations, which enables us to learn the node embeddings and graph embeddings without any labeled data. To demonstrate the effectiveness of the proposed ...
Uploads
Papers by Justin Sybrandt