Graph-Powered Machine Learning
()
About this ebook
Summary
In Graph-Powered Machine Learning, you will learn:
The lifecycle of a machine learning project
Graphs in big data platforms
Data source modeling using graphs
Graph-based natural language processing, recommendations, and fraud detection techniques
Graph algorithms
Working with Neo4J
Graph-Powered Machine Learning teaches to use graph-based algorithms and data organization strategies to develop superior machine learning applications. You’ll dive into the role of graphs in machine learning and big data platforms, and take an in-depth look at data source modeling, algorithm design, recommendations, and fraud detection. Explore end-to-end projects that illustrate architectures and help you optimize with best design practices. Author Alessandro Negro’s extensive experience shines through in every chapter, as you learn from examples and concrete scenarios based on his work with real clients!
Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.
About the technology
Identifying relationships is the foundation of machine learning. By recognizing and analyzing the connections in your data, graph-centric algorithms like K-nearest neighbor or PageRank radically improve the effectiveness of ML applications. Graph-based machine learning techniques offer a powerful new perspective for machine learning in social networking, fraud detection, natural language processing, and recommendation systems.
About the book
Graph-Powered Machine Learning teaches you how to exploit the natural relationships in structured and unstructured datasets using graph-oriented machine learning algorithms and tools. In this authoritative book, you’ll master the architectures and design practices of graphs, and avoid common pitfalls. Author Alessandro Negro explores examples from real-world applications that connect GraphML concepts to real world tasks.
What's inside
Graphs in big data platforms
Recommendations, natural language processing, fraud detection
Graph algorithms
Working with the Neo4J graph database
About the reader
For readers comfortable with machine learning basics.
About the author
Alessandro Negro is Chief Scientist at GraphAware. He has been a speaker at many conferences, and holds a PhD in Computer Science.
Table of Contents
PART 1 INTRODUCTION
1 Machine learning and graphs: An introduction
2 Graph data engineering
3 Graphs in machine learning applications
PART 2 RECOMMENDATIONS
4 Content-based recommendations
5 Collaborative filtering
6 Session-based recommendations
7 Context-aware and hybrid recommendations
PART 3 FIGHTING FRAUD
8 Basic approaches to graph-powered fraud detection
9 Proximity-based algorithms
10 Social network analysis against fraud
PART 4 TAMING TEXT WITH GRAPHS
11 Graph-based natural language processing
12 Knowledge graphs
Related to Graph-Powered Machine Learning
Related ebooks
Deep Reinforcement Learning in Action Rating: 4 out of 5 stars4/5Probabilistic Deep Learning: With Python, Keras and TensorFlow Probability Rating: 0 out of 5 stars0 ratingsMLOps Engineering at Scale Rating: 0 out of 5 stars0 ratingsDeep Learning with Structured Data Rating: 0 out of 5 stars0 ratingsMachine Learning Engineering in Action Rating: 0 out of 5 stars0 ratingsMachine Learning Bookcamp: Build a portfolio of real-life projects Rating: 4 out of 5 stars4/5Think Like a Data Scientist: Tackle the data science process step-by-step Rating: 0 out of 5 stars0 ratingsReal-World Machine Learning Rating: 0 out of 5 stars0 ratingsGANs in Action: Deep learning with Generative Adversarial Networks Rating: 0 out of 5 stars0 ratingsMachine Learning with TensorFlow, Second Edition Rating: 0 out of 5 stars0 ratingsGrokking Artificial Intelligence Algorithms Rating: 0 out of 5 stars0 ratingsGrokking Machine Learning Rating: 0 out of 5 stars0 ratingsVisualizing Graph Data Rating: 0 out of 5 stars0 ratingsInterpretable AI: Building explainable machine learning systems Rating: 0 out of 5 stars0 ratingsMachine Learning Systems: Designs that scale Rating: 0 out of 5 stars0 ratingsIntroducing Data Science: Big data, machine learning, and more, using Python tools Rating: 5 out of 5 stars5/5Feature Engineering Bookcamp Rating: 0 out of 5 stars0 ratingsDeep Learning with Keras: Beginner’s Guide to Deep Learning with Keras Rating: 3 out of 5 stars3/5Streaming Data: Understanding the real-time pipeline Rating: 0 out of 5 stars0 ratingsEffective Data Science Infrastructure: How to make data scientists productive Rating: 0 out of 5 stars0 ratingsPython Data Science Essentials Rating: 0 out of 5 stars0 ratingsManaging Machine Learning Projects: From design to deployment Rating: 5 out of 5 stars5/5Learn Quantum Computing with Python and Q#: A hands-on approach Rating: 0 out of 5 stars0 ratingsDeep Learning for Search Rating: 0 out of 5 stars0 ratingsHuman-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI Rating: 0 out of 5 stars0 ratingsMachine Learning in Action Rating: 0 out of 5 stars0 ratingsData Science Bookcamp: Five real-world Python projects Rating: 5 out of 5 stars5/5Deep Learning for Vision Systems Rating: 5 out of 5 stars5/5Transfer Learning for Natural Language Processing Rating: 0 out of 5 stars0 ratings
Intelligence (AI) & Semantics For You
Artificial Intelligence: A Guide for Thinking Humans Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5The Alignment Problem: How Can Machines Learn Human Values? Rating: 4 out of 5 stars4/5Deep Utopia: Life and Meaning in a Solved World Rating: 0 out of 5 stars0 ratingsAlgorithms to Live By: The Computer Science of Human Decisions Rating: 4 out of 5 stars4/5The Secrets of ChatGPT Prompt Engineering for Non-Developers Rating: 5 out of 5 stars5/5Midjourney Mastery - The Ultimate Handbook of Prompts Rating: 5 out of 5 stars5/52084: Artificial Intelligence and the Future of Humanity Rating: 4 out of 5 stars4/5How To Become A Data Scientist With ChatGPT: A Beginner's Guide to ChatGPT-Assisted Programming Rating: 4 out of 5 stars4/5The ChatGPT Revolution: How to Simplify Your Work and Life Admin with AI Rating: 0 out of 5 stars0 ratingsChatGPT For Dummies Rating: 4 out of 5 stars4/5Summary of Super-Intelligence From Nick Bostrom Rating: 4 out of 5 stars4/5ChatGPT Rating: 3 out of 5 stars3/5Impromptu: Amplifying Our Humanity Through AI Rating: 5 out of 5 stars5/5The Business Case for AI: A Leader's Guide to AI Strategies, Best Practices & Real-World Applications Rating: 0 out of 5 stars0 ratingsOur Final Invention: Artificial Intelligence and the End of the Human Era Rating: 4 out of 5 stars4/5Thinking in Algorithms: Strategic Thinking Skills, #2 Rating: 4 out of 5 stars4/5101 Midjourney Prompt Secrets Rating: 3 out of 5 stars3/5Prompt Engineering ; The Future Of Language Generation Rating: 3 out of 5 stars3/5The Roadmap to AI Mastery: A Guide to Building and Scaling Projects Rating: 3 out of 5 stars3/5Writing AI Prompts For Dummies Rating: 0 out of 5 stars0 ratingsAdvances in Financial Machine Learning Rating: 5 out of 5 stars5/5
Reviews for Graph-Powered Machine Learning
0 ratings0 reviews
Book preview
Graph-Powered Machine Learning - Alessandro Negro
Part 1 Introduction
We are surrounded by graphs. Facebook, LinkedIn, and Twitter are the most famous examples of social networks—that is, graphs of people. Other types of graphs exist even though we don’t think of them as such: electrical or power networks, the tube, and so on.
Graphs are powerful structures useful not only for representing connected information, but also for supporting multiple types of analysis. Their simple data model, consisting of two basic concepts such as nodes and relationships, is flexible enough to store complex information. If you also store properties in nodes and relationships, it is possible to represent practically everything of any size.
Furthermore, in a graph every single node and every single relationship is an access point for analysis, and from an access point, it is possible to navigate the rest in an endless way, which provides multiple access patterns and analysis potentials.
Machine learning, on the other side, provides tools and techniques for making representations of reality and providing predictions. Recommendation is a good example; the algorithm takes what the users interacted with and is capable of predicting what they will be interested in. Fraud detection is another one, taking the previous transactions (legit or not) and creating a model that can recognize with a good approximation whether a new transaction is fraudulent.
The performance of machine learning algorithms, both in terms of accuracy and speed, is affected almost directly from the way in which we represent our training data and store our prediction model. The quality of algorithm prediction is as good as the quality of the training dataset. Data cleansing and feature selection, among other tasks, are mandatory if we would like to achieve a reasonable level of trust in the prediction. The speed at which the system provides prediction affects the usability of the entire product. Suppose that a recommendation algorithm for an online retailer produced recommendations in 3 minutes. By that time, the user would be on another page or, worse, on a competitor’s website.
Graphs can support machine learning by doing what they do best: representing data in a way that is easily understandable and easily accessible. Graphs make all the necessary processes faster, more accurate, and much more effective. Moreover, graph algorithms are powerful tools for machine learning practitioners. Graph community detection algorithms can help identify groups of people, page rank can reveal the most relevant keywords in a text, and so on.
If you didn’t fully understand some of the terms and concepts presented in the introduction, the first part of the book will provide you all the knowledge you need to move further in the book. It introduces the basic concepts related to graphs and machine learning as single, independent entities and as powerful binomials. Let me wish you good reading!
1 Machine learning and graphs: An introduction
This chapter covers
An introduction to machine learning
An introduction to graphs
The role of graphs in machine learning applications
Machine learning is a core branch of artificial intelligence: it is the field of study in computer science that allows computer programs to learn from data. The term was coined in 1959, when Arthur Samuel, an IBM computer scientist, wrote the first computer program to play checkers [Samuel, 1959]. He had a clear idea in mind:
Programming computers to learn from experience should eventually eliminate the need for much of this detailed programming effort.
Samuel wrote his initial program by assigning a score to each board position based on a fixed formula. This program worked quite well, but in a second approach, he had the program execute thousands of games against itself and used the results to refine the board scoring. Eventually, the program reached the proficiency of a human player, and machine learning took its first steps.
An entity—such as a person, an animal, an algorithm, or a generic computer agent¹—is learning if, after making observations about the world, it is able to improve its performance on future tasks. In other words, learning is the process of converting experience to expertise or knowledge [Shalev-Shwartz and Ben-David, 2014]. Learning algorithms use training data that represents experience as input and create expertise as output. That output can be a computer program, a complex predictive model, or tuning of internal variables. The definition of performance depends on the specific algorithm or goal to be achieved; in general, we consider it to be the extent to which the prediction matches specific needs.
Let’s describe the learning process with an example. Consider the implementation of a spam filter for emails. A pure programming solution would be to write a program to memorize all the emails labeled as spam by a human user. When a new email arrives, the pseudoagent will search for a similar match in the previous spam emails, and if it finds any matches, the new email will be rerouted to the trash folder. Otherwise, the email will pass through the filter untouched.
This approach could work and, in some scenarios, be useful. Yet it is not a learning process because it lacks an important aspect of learning: the ability to generalize, to transform the individual examples into a broader model. In this specific use case, it means the ability to label unseen emails even though they are dissimilar to previously labeled emails. This process is also referred to as inductive reasoning or inductive inference.² To generalize, the algorithm should scan the training data and extract a set of words whose appearance in an email message is indicative of spam. Then, for a new email, the agent would check whether one or more of the suspicious words appear and predict its label accordingly.
If you are an experienced developer, you might be wondering, Why should I write a program that learns how to program itself, when I can instruct the computer to carry out the task at hand?
Taking the example of the spam filter, it is possible to write a program that checks for the occurrence of some words and classifies an email as spam if those words are present. But this approach has three primary disadvantages:
A developer cannot anticipate all possible situations. In the spam-filter use case, all the words that might be used in a spam email cannot be predicted up front.
A developer cannot anticipate all changes over time. In spam emails, new words can be used, or techniques can be adopted to avoid easy recognition, such as adding hyphens or spaces between characters.
Sometimes, a developer cannot write a program to accomplish the task. Even though recognizing the face of a friend is a simple task for a human, for example, it is impossible to program software to accomplish this task without the use of machine learning.
Therefore, when you face new problems or tasks that you would like to solve with a computer program, the following questions can help you decide whether to use machine learning:
Is the specific task too complex to be programmed?
Does the task require any sort of adaptivity throughout its life?
A crucial aspect of any machine learning task is the training data on which the knowledge is built. Starting from the wrong data leads to the wrong results, regardless of the potential performance or the quality of the learning algorithm used.
The aim of this book is to help data scientists and data engineers approach the machine learning process from two sides: the learning algorithm and the data. In both perspectives, we will use the graph (let me introduce it now as a set of nodes and relationships connecting them) as a valuable mental and technological model. A lot of learning algorithms based on data represented as graphs can deliver efficient predictive models, and other algorithms can be improved by using either data represented as graphs or graph algorithms in the workflow. The use of graphs also provides many other benefits: graphics are a valuable storage model for representing knowledge from the input of the process, managing the training data, and storing the output of the predictive model, providing multiple ways to access it quickly. This book walks the reader through the entire machine learning project life cycle, showing step by step all cases in which graphs could be valuable and reliable friends.
But graphs are not a panacea for all machine learning projects. In stream analytics, in which it is necessary to process a stream of data to reveal short-term anomalies, storing data in the form of a graph could be useless. Furthermore, other algorithms require data in a format that cannot fit in a graph, either during training or for model storage and access. This book gives the reader the capability to discern whether using a graph in the process would be an advantage or a burden.
1.1 Machine learning project life cycle
A machine learning project is a human process as well as a software project. It involves a large number of people, a lot of communication, a great deal of work, and a mixed set of skills, and it requires a well-defined approach to be effective. We’ll start our long journey by defining a workflow with clear steps and components that will be used throughout the book. The mental schema proposed here, which is one of many possible schemas, will help you better understand the role of graphs in the development and deployment of a successful machine learning project.
Delivering machine learning solutions is a complex process that requires more than selecting the right algorithm(s). Such projects include numerous tasks related to [Sculley, 2015]
Selecting the data sources
Gathering data
Understanding the data
Cleaning and transforming the data
Processing the data to create ML models
Evaluating the results
Deploying
After deployment, it is necessary to monitor the application and tune it. The entire process involves multiple tools, a lot of data, and different people.
One of the most commonly used processes for data mining projects is the Cross-Industry Standard Process for Data Mining, or CRISP-DM [Wirth and Hipp, 2000]. Although the CRISP-DM model was designed for data mining, it can also be applied to generic machine learning projects. Key features of the CRISP-DM that make it attractive as part of the base workflow model are
It is not proprietary.
It is application, industry, and tool neutral.
It explicitly views the data analytics process from both an application-focused and a technical perspective.
This method can be used for project planning and management, for communicating, and for documentation purposes.
The CRISP-DM reference model offers an overview of the life cycle of a machine learning project. This schema or mental model helps in approaching the machine learning project from a data perspective before taking an algorithmic point of view and provides the baseline for the definition of a clear workflow. Figure 1.1 shows the six phases of the process. It is worth noting that data is at the core of this process.
Looking at figure 1.1, we see that the sequence of the phases is fluid. The arrows indicate only the most important and frequent dependencies between phases; in a particular project, the outcome of each phase determines which phase, or which particular task of a phase, has to be performed next.
CH01_F01_NegroFigure 1.1 The six phases of the CRISP-DM process
The outer circle symbolizes the cyclic nature of the process, which is not finished when a solution is deployed. Subsequent machine learning processes can benefit not only from the experiences of previous ones (the virtuous cycle of Linoff and Berry [2011]), but also from the outcomes of the previous processes. Let’s outline each phase in more detail.
1.1.1 Business understanding
The first phase requires defining the goals of the machine learning project. These objectives are often expressed in general terms: increase revenue, improve the customer experience, get better and customized search results, sell more products, and so on. To convert these high-level problem definitions to concrete requirements and constraints for the machine learning project, it is necessary to understand the business and the domain.
Machine learning projects are software projects, and during this phase, it is also important to learn the language and domain concepts. This knowledge will not only help with communication between the data scientist and the internal team during subsequent phases, but also improve the quality of the documentation and the presentation of the results.
The outcomes of this phase are
A clear understanding of the domain and the business perspective
A definition of the goals, requirements, and constraints
The conversion of this knowledge to a machine learning problem definition
A preliminary and reasonable project plan designed to achieve the objectives
The goals of the first iteration shouldn’t be too broad, because this round requires a lot of effort related to the injection of the machine learning process into the existing infrastructure. At the same time, it is important to keep future extensions in mind while designing the first iteration.
1.1.2 Data understanding
The data-understanding phase starts by inquiring about the data sources and collecting some data from each of them, and proceeds with these activities:
Get familiar with the data.
Identify data quality problems.
Get first insights into the data.
Detect interesting subsets to form hypotheses about hidden information.
Data understanding requires domain and business understanding. Moreover, looking at the data helps build understanding of the domain and the business perspective, which is why there is a feedback loop between this phase and the previous one.
The outcomes of this phase are
A clear understanding of the data sources available
A clear understanding of the different kinds of data and their content (or at least of all the significant parts of the machine learning goals)
An architecture design to get or extract this data and feed it into the next steps of the machine learning workflow
1.1.3 Data preparation
This phase covers all the activities to gather data from multiple sources and organize it in the specific kind of structure required by the algorithm in the modeling phase. Data preparation tasks include record and attribute selection, feature engineering, data merging, data cleaning, construction of new attributes, and enrichment of existing data. As pointed out before, the quality of the data has an enormous impact on the final results of the next phases, so this phase is crucial.
The outcomes of this phase are
One or more data structure definitions, using adequate design techniques
A well-defined data pipeline for feeding the machine learning algorithm training data
A set of procedures for merging, cleaning, and enriching data
Another outcome of this phase is the identification of the database management system where this data will be stored while waiting to be processed.
For the sake of completeness, it is not always required to have an explicit data store for persisting data before further processing. It is possible to extract data and convert it before the processing phase. Having such an intermediate step, however, has a lot of advantages related to performance and data quality as well as to further extensibility.
1.1.4 Modeling
The modeling phase is where machine learning occurs. Different algorithms are selected and applied, and their parameters are calibrated to optimal values. The algorithms are used to build a range of prediction models from which the best is selected for deployment when the evaluation phase is complete. An interesting aspect is that some algorithms produce a prediction model, but others do not.³
The outcomes of this phase are
The set of algorithms to be tested in the next phase
The related predictive models (where applicable)
There is a close link between data preparation and modeling, because you often identify problems with the data and get ideas for constructing new data points while modeling. Moreover, some techniques require specific data formats.
1.1.5 Evaluation
At this stage in a machine learning project, you have built one or more predictive models that appear to be of high quality. Before a model can be deployed, it is important to evaluate it thoroughly and review the steps executed to construct the model, so that you can be certain it properly achieves the business objectives defined at the beginning of the process.
This evaluation is conducted in a formal way, such as splitting the available data into a training set (80%) and a testing set (20%). Another primary objective is to determine whether any important business issues have not been sufficiently considered.
The outcomes of this phase are
A set of values that allows measurement of performance. (The specific measure of success depends on the algorithm type and the scope.)
A thorough evaluation of whether the business goals are achieved.
Authorization for using the solution in a production environment.
1.1.6 Deployment
Because machine learning models are built to serve some purpose in an organization, the creation of a model is generally not the end of the project. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as releasing a complete infrastructure that delivers a service to end users. In many cases, the customer—not the data scientist—carries out the deployment steps. In any event, it is important to understand up front what actions will need to be carried out to make use of the created models.
The outcomes of this phase are as follows:
One or multiple reports with the results of the prediction models
The predictive models themselves, used for predicting the future and supporting decisions
An infrastructure to provide a specific set of services to the end users
When the project is in production, it is necessary to monitor it constantly (evaluating performance, for example).
1.2 Machine learning challenges
Machine learning projects have some intrinsic challenges that make them complex to accomplish. This section summarizes the main aspects you need to take into account when approaching a new machine learning project.
1.2.1 The source of truth
The CRISP-DM model puts data at the center of the machine learning process by describing the life cycle from a data perspective. The training data represents the source of truth from which any insight can be extracted and any prediction can be made. Managing the training requires a lot of effort. To quote Jeffrey Heer, professor of computer science at the University of Washington, It’s an absolute myth that you can send an algorithm over raw data and have insights pop up.
As a case in point, it has been estimated that data scientists spend up to 80% of their time on data preparation [Lohr, 2014].
I often use the following statement to shift the focus to data before discussing the details of the algorithms:
Even the best learning algorithm on the wrong data produces the wrong results.
The seminal papers by Banko and Brill [2001] and Halevy, Norvig, and Pereira [2009] point out how, for complex problems, data often matters more than algorithms. Both articles consider natural language processing, but the concept can be generalized to machine learning in general [Sculley, 2015].
Figure 1.2, from Banko and Brill [2001], shows the learning curves for a set of learners, considering the average performance of each on different sizes of training data (up to 1 billion words). The specific algorithms used are not important here; the key takeaway is that, as is clear from the image, increasing the amount of data available during the training phase improved the performance of all the learners. These results suggest that it’s worth reconsidering spending time and money on corpus development versus spending it on algorithm development. From another perspective, as a data scientist you can focus on the vertical dimension—finding a better algorithm—but the graph shows that there is more room for improvement in the horizontal direction—gathering more data. As proof, the worst-performing algorithm in figure 1.2 performs much better with 10 million elements than the best one with 1 million elements.
CH01_F02_NegroFigure 1.2 Learning curves for confusion set disambiguation
Collecting data from multiple sources allows you not only to access a huge set of data, but also to improve the quality of the data, solving problems such as sparsity, misspellings, correctness, and so on. Gathering data from a variety of sources is not an issue; we live in the big data era, and an abundance of digital data is available from the web, sensors, smartphones, corporate databases, and open data sources. But if there is value in combining different datasets, there are problems too. Data from different sources comes in different formats. Before the learner can analyze it, the data must be cleaned up, merged, and normalized into a unified and homogeneous schema that the algorithm can understand. Furthermore, obtaining additional training data has a nonzero cost for many problems, and for supervised learning, this cost may be high.
For these reasons, the data represents the first big challenge in the machine learning process. Data concerns can be summarized in four categories:
Insufficient quantity of data—Machine learning requires a lot of training data to work properly. Even for simple use cases, it needs thousands of examples, and for complex problems such as deep learning or for nonlinear algorithms, you may need millions of examples.
Poor quality of data—Data sources are always full of errors, outliers, and noise. Poor data quality directly affects the quality of the results of the machine learning process because it is hard for many algorithms to discard wrong (incorrect, unrelated, or irrelevant) values and to then detect underlying patterns in this mess.
Nonrepresentative data—Machine learning is a process of induction: the model makes inferences from what it has observed and will likely not support edge cases that your training data does not include. Furthermore, if the training data is too noisy or is related only to a subset of possible cases, the learner might generate bias or overfit the training data and will not be able to generalize to all the possible cases. This is true for both instance-based and model-based learning algorithms.
Irrelevant features—The algorithm will learn in the right way if the data contains a good set of relevant features and not too many irrelevant features. Although it is often a useful strategy to select more features, with the goal of increasing the accuracy of the model, more is not always better. Using more features will enable the learner to find a more detailed mapping from feature to target, which increases the risk that the model computed will overfit the data. Feature selection and feature extraction represent two important tasks during the preparation of data.
To overcome these issues, data scientists have to gather and merge data from multiple sources, clean it, and enrich it by using external sources. (Moreover, it often happens that data is prepared for a certain purpose, but along the way, you discover something new, and the desired purpose changes.) These tasks are not simple ones; they require not only a significant amount of professional skill, but also a data management platform that allows the changes to be performed in a convenient way.
The issues related to the quality of the training examples determine a set of data management constraints and requirements for the infrastructure of the machine learning project. The problems can be summarized as follows:
Managing big data—Gathering data from multiple data sources and merging it into a unified source of truth will generate a huge dataset, and as pointed out before, increasing the amount of (quality) data will improve the quality of the learning process. Chapter 2 considers the characteristics of a big data platform and shows how graphs can play a prominent role in taming such a beast.
Designing a flexible schema—Try to create a schema model that provides the capabilities to merge multiple heterogeneous schemas into a unified and homogeneous data structure that satisfies the informational and navigational requirements. The schema should evolve easily according to changes in the purpose of the machine learning project. Chapter 4 introduces multiple data model schemas and best practices to model the data for several scenarios.
Developing efficient access patterns—Fast data reads boost the performance, in terms of processing time, of the training process. Tasks such as feature extraction, filtering, cleaning, merging, and other preprocessing of the training data will benefit from the use of a data platform that provides multiple and flexible access patterns.
1.2.2 Performance
Performance is a complex topic in machine learning because it can be related to multiple factors:
Predictive accuracy, which can be evaluated by using different performance measures. A typical performance measure for regression problems is the root mean squared error (RMSE), which measures the standard deviation⁴ of the errors the system makes in its predictions. In other words, it looks at the difference between the estimated value and the known value for all the samples in the test dataset and calculates the average value. I present other techniques for measuring performance later in the book, when discussing the different algorithms. Accuracy depends on several factors, such as the amount of data available for training the model, the quality of the data, and the algorithm selected. As discussed in section 1.2.1, the data plays a primary role in guaranteeing a proper level of accuracy.
Training performance, which refers to the time required to compute the model. The amount of data to be processed and the type of algorithm used determine the processing time and the storage required for computing the prediction model. Clearly, this problem affects more the algorithms that produce a model as a result of the training phase. For instance-based learners,⁵ performance problems will appear later in the process, such as during prediction. In batch learning, the training time is generally longer, due to the amount of data to be processed (compared with the online learning approach, in which the algorithm learns incrementally from a smaller amount of data). Although in online learning, the amount of data to be processed is small, the speed at which it is crunched affects the capacity of the system to be aligned with the latest data available, which directly affects the accuracy of the predictions.
Prediction performance, which refers to the response time required to provide predictions. The output of the machine learning project could be a static one-time report to help managers make strategic decisions or an online service for end users. In the first case, the time required to complete the prediction phase and compute the model is not a significant concern, so long as the work is completed in a reasonable time frame (that is, not years). In the second case, prediction speed does matter, because it affects the user experience and the efficacy of the prediction. Suppose that you are developing a recommendation engine that recommends products similar to what the user is currently viewing according to their interests. The user navigation speed is quite high, which implies the need for a significant number of predictions in a short time interval; only a few milliseconds are available to suggest something useful before the user proceeds to the next item. In this scenario, the speed of prediction is the key to success.
These factors can be translated into multiple requirements for machine learning projects, such as fast access to the data source during training, high data quality, an efficient access pattern for the model to accelerate predictions, and so on. In this context, graphs could provide the proper storage mechanism for both source and model data, reducing the access time required to read data as well as offering multiple algorithmic techniques for improving the accuracy of the predictions.
1.2.3 Storing the model
In the model-based learner approach, the output of the training phase is a model that will be used for making predictions. This model requires time to be computed and has to be stored in a persistence layer to avoid recomputation at every system restart.
The model’s structure is related directly to the specific algorithm or the algorithm class employed. Examples include
Item-to-item similarities for a recommendation engine that uses a nearest-neighbor approach
An item-cluster mapping that expresses how the elements are grouped in clusters
The sizes of the two models differ enormously. Consider a system that contains 100 items. As a first effort, item-to-item similarities would require 100 x 100 entries to be stored. Taking advantage of optimizations, this number can be reduced to consider only the top k similar items, in which case the model will require 100 x k entries. By contrast, item-cluster mapping requires only 100 entries; hence, the space required to store the model in memory or on disk could be huge or relatively modest. Moreover, as pointed out earlier, model access/query time affects global performance during the prediction phase. For these reasons, model storage management represents a significant challenge in machine learning.
1.2.4 Real time
Machine learning is being used more and more frequently to deliver real-time services to users. Examples run the full spectrum from simple recommendation engines that react to the user’s last clicks all the way to self-driving cars that have been instructed not to injure a pedestrian crossing the street. Although the outcomes in case of failure are vastly different in these two examples, in both cases the capability of the learner to react quickly (or in an appropriate timely manner) to new stimuli coming from the environment is fundamental for the quality of the final results.
Consider a recommendation engine that provides real-time recommendations for anonymous users. This anonymity (the user is not registered or logged in) means that there is no long-term history of previous interactions—only short-term, session-based information provided by the use of cookies. This task is a complex one that involves multiple aspects and affects several phases of a machine learning project. The approaches taken could differ according to the learner(s) used, but the goals can be described as follows:
Learn fast. The online learner should be able to update the model as soon as new data is available. This capability will reduce the time gap between events or generic feedback, such as navigational clicks or interaction with a search session and updating of the model. The more the model is aligned to the latest events, the more it is able to meet the current needs of the user.
Predict fast. When the model is updated, the prediction should be fast—a maximum of a few milliseconds—because the user may navigate away from the current page or even change their opinion quite rapidly.
Both of these goals require algorithms that can align the model quickly, as well as a storage mechanism (in memory, on disk, or a combined version) that provides fast memorization and efficient access patterns.
1.3 Graphs
As stated at the introduction of this chapter, graphs provide models and algorithms that can heavily support machine learning projects. Even though a graph is a simple concept, it is important to understand how to represent it and how to use the main concepts around it. This section introduces the key elements of the graph world. If you already dominate such concepts, you can skip this part.
1.3.1 What is a graph?
A graph is a simple and quite old mathematical concept: a data structure consisting of a set of vertices (or nodes/points) and edges (or relationships/lines) that can be used to model relationships among a collection of objects. Legend says that the lazy Leonhard Euler first started talking about graphs in 1736. While visiting Königsberg in Prussia (now Kaliningrad, Russia), Euler didn’t want to spend too much time walking in the city, which sat on both sides of the Pregel River and included two large islands that were connected to each other and to the two mainland portions of the city by seven bridges. Euler formalized the problem as planning a walk through the city that would cross each of those bridges once and only once. He proved that doing so was impossible, which led to the invention of graphs and graph theory [Euler, 1736]. So he stayed home instead. Figure 1.3 shows an old map of Königsberg and a representation of the graph that Euler used to prove his thesis.
CH01_F03a_NegroCH01_F03b_NegroFigure 1.3 The bridges in Königsberg that led to the invention of graph theory
More formally, a graph is a pair G = (V, E), where V is a collection of vertices V = {Vi, i = 1,n} and E is a collection of edges over V, Eij = {(Vi, Vj), Vi ∊ V, Vj ∊ V}. E ⊆ [V]²; thus, the elements of E are two-element subsets of V [Diestel, 2017].
The simplest way to represent a graph is to draw a dot or a small circle for each vertex and then join two of those vertices by a line if they form an edge. This more formalized description is shown in figure 1.4.
CH01_F04_NegroFigure 1.4 The undirected graph on V = {1, 2, 3, 4, 5} with edge set E = {(1,2), (1,5), (2,5), (2,4), (4,3)}
Graphs can be directed or undirected, depending on whether a direction of traversal is defined on the edges. In directed graphs, an edge Eij can be traversed from Vi to Vj but not in the opposite direction; Vi is called the tail, or start node, and Vj is called the head, or end node. In undirected graphs, edge traversals in both directions are valid. Figure 1.4 represents an undirected graph, and figure 1.5 represents a directed graph.
CH01_F05_NegroFigure 1.5 The directed graph on V = {1, ..., 5} with edge set E = {(1,2), (2,5), (5,1), (2,4), (3,4)}
The arrow indicates the direction of the relationship. By default, edges in a graph are unweighted; thus, the corresponding graphs are said to be unweighted. When a weight—a numerical value used to convey some significance—is assigned to the edges, the graph is said to be weighted. Figure 1.6 shows the same graphs as figures 1.4 and 1.5 with a weight assigned to each edge.
CH01_F06_NegroFigure 1.6 An undirected weighted graph (a) and a directed weighted graph (b)
Two vertices x and y of G are defined as adjacent, or neighbors, if {x,y} is an edge of G. The edge Eij connecting them is said to be incident on the two vertices Vi and Vj. Two distinct edges e and f are adjacent if they have a vertex in common. If all the vertices of G are pairwise adjacent, G is complete. Figure 1.7 shows a complete graph in which each vertex is connected to all the other vertices.
CH01_F07_NegroFigure 1.7 A complete graph in which each vertex is connected to all the others
One of the most important properties of a vertex in a graph is its degree, defined as the total number of edges incident to that vertex, which is also equal to the number of neighbors of that vertex. In the undirected graph of figure 1.4, for example, the vertex 2 has degree 3 (it has the vertices 1, 4, and 5 as neighbors); the vertices 1 (neighbors are 2, 5), 4 (neighbors are 2, 3), and 5 (neighbors are 1, 2) have degree 2, whereas vertex 3 has degree 1 (is connected only with 4).
In a directed graph, the degree of a vertex Vi is split into the in-degree of the vertex, defined as the number of edges for which Vi is their end node (the head of the arrow) and the out-degree of the vertex, which is the number of edges for which Vi is their start node (the tail of the arrow). In the directed graph of figure 1.5, vertices 1 and 5 have an in-degree and out-degree of 1 (they each have two relationships, one ingoing and one outgoing), vertex 2 has an in-degree of 1 and an out-degree of 2 (one ingoing relationship from 1 and two outgoing to 4 and 5), vertex 4 has an in-degree of 2 and an out-degree of 0 (two ingoing relationships from 2 and 3), and vertex 3 has an out-degree of 1 and in-degree of 0 (one outgoing relationship to 4).
The average degree of a graph is computed as follows,
CH01_F07_EQ01_Negrowhere N is the number of vertices in the graph.
A sequence of vertices with the property that each consecutive pair in the sequence is connected by an edge is called a path. A path with no repeating vertices is called a simple path. A cycle is a path in which the first and the last vertex coincide. In figure 1.4, [1,2,4], [1,2,4,3], [1,5,2,4,3], and so on are paths; in particular, the path of vertices [1,2,5] represents a cycle.
1.3.2 Graphs as models of networks
Graphs are useful for representing how things are either physically or logically linked in simple or complex structures. A graph in which we assign names and meanings to the edges and vertices becomes what is known as a network. In these cases, a graph is the mathematical model for describing a network, whereas a network is a set of relations between objects, which could include people, organizations, nations, items found in a Google search, brain cells, or electrical transformers. This diversity illustrates the great power of graphs and their simple structure (which also means that they require a small amount of disk storage capacity) that can be used to model⁶ a complex system.
Let’s explore this concept by using an example. Suppose that we have the graph shown in figure 1.8.
CH01_F08_NegroFigure 1.8 A nontrivial generic graph
This graph, which is pure in the mathematical definition, can be used to model several types of networks, according to the types of edges and vertices:
A social network, if the vertices are people and each edge represents any sort of relationship between humans (friendship, family member, coworker)
An informational network, if the vertices are information structures such as web pages, documents, or papers and the edges represent logical connections such as hyperlinks, citations, or cross-references
A communication network, if the vertices are computers or other devices that can relay messages and the edges represent direct links along which messages can be transmitted
A transportation network, if the vertices are cities and the edges represent direct connections using flights or trains or roads
This small set of examples demonstrates how the same graph can represent multiple networks by assigning different semantics to edges and vertices. Figure 1.9 illustrates different types of networks.
Looking at figure 1.9, we can spot another interesting characteristic of graphs: they are highly communicative. Graphics are able to display information in a clear manner, which is why they are often used as information maps. Representing data as networks and using graph algorithms, it is possible to
Find complex patterns
Make them visible for further investigation and interpretation
CH01_F09a_NegroCH01_F09b_NegroCH01_F09c_NegroCH01_F09d_NegroFigure 1.9 Clockwise from top left: a co-occurrence network,⁷ ARPA network 1974,⁸ London Tube network,⁹ and electrical grid¹⁰
When the power of machine learning is combined with the power of the human brain, it enables efficient, advanced, and sophisticated data processing and pattern recognition. Networks are useful for displaying data by highlighting connections among elements. Newspapers and news websites are increasingly using networks, not only to help people navigate the data, but also as a powerful investigation tool. Recently (at the time of this writing), the Panama Papers¹¹ showcased the astonishing features of networks. The International Consortium of Investigative Journalists (ICIJ) analyzed the leaked financial documents to expose highly connected networks of offshore tax structures used by the world’s richest elites. The journalists extracted the entities (people, organizations, and any sort of intermediaries) and relationships (protector, beneficiary, shareholder, director, and so on) from the documents, stored them in a network, and analyzed them by using visual tools. The results looked like figure 1.10. Here, networks, graph algorithms, and graph visualization made evident something that otherwise would have been impossible to discover by using traditional data mining tools.
CH01_F10_NegroFigure 1.10 An example of the graph visualization for the Panama Papers
A lot of interesting examples in this direction are also available in blog posts by Valdis Krebs,¹² an organization consultant who specializes in social network applications. His work contains examples of mixing graph-powered machine learning with the human mind, passing through graph visualization. Here, we consider one of the most famous examples.
The data in figure 1.11 was gathered from Amazon.com and represents its list of the top political books purchased in the United States in 2008 [Krebs, 2012]. Krebs employed network analysis principles to the data to create a map of books related to that year’s presidential election. Two books are linked if they were often purchased by the same customer. These books are known as also-bought pairs (in that a customer who bought this book also bought that book).
CH01_F11_NegroFigure 1.11 Network map of US political books in 2008 (Krebs, 2012)
There are three different and nonoverlapping clusters:
An Obama cluster of books in the top-left corner
A Democratic (blue) cluster in the middle
A Republican (red) cluster in the bottom-right corner
In 2008, the US political climate was highly polarized. This fact is mirrored in Amazon’s political-book data, with figure 1.11 showing a deep divide between conservative and liberal voters. There were no connections or intermediaries between red and blue books; each cluster was completely distinct from the others. As mentioned, there was a separate cluster of people reading biographies of presidential hopeful Barack Obama, but they were apparently not interested in reading or purchasing other political books.
Four years later, in 2012, the same analysis produced a network that appeared to be substantially different (figure 1.12). This network shows a lot of books that act as bridges between clusters. Moreover, potential voters appear to be reading books about both major candidates. The result is a more complex network that has no isolated clusters.
CH01_F12_NegroFigure 1.12 Network map of US political books in 2012 [Krebs, 2012]
The example of a network of political books introduces an important aspect of networks. If a graph is a pure mathematical concept that lives in its own Platonic world,¹³ networks, as abstractions of some concrete system or ecosystem, are subjected to forces that, acting on them, change their structure. We refer to these forces as surrounding contexts : factors that exist outside the vertices and edges of a network but nonetheless affect how the network’s structure evolves over time. The nature of such contexts and the types of forces are specific to the kind of network. In social networks, for example, each individual has a distinctive set of personal characteristics, and similarities and compatibilities between two people’s characteristics influence link creation or deletion [Easley and Kleinberg, 2010].
One of the most basic notions governing the structure of social networks is homophily (from the Greek, meaning love of the same): links in a social network tend to connect people who are similar. More formally, if two people have characteristics that match in a proportion greater than expected in the population from which they are drawn or the network of which they are part, they are more likely to be connected [Verbrugge, 1977]. The converse is also true: if two people are connected, they are more likely to have common characteristics or attributes. For this reason, our friends on Facebook (for example) don’t look like a random sample of people, but are generally similar to us along ethnic, racial, and geographic dimensions; they tend to be similar to us in age, occupation, interests, beliefs, and opinions. This observation has a long history, with origins long before Mark Zuckerberg wrote his first line of code. The underlying idea can be found in the writings of Plato (Similarity begets friendship
) and Aristotle (people love those who are like themselves
), as well as in folk propositions such as Birds of a feather flock together.
The homophily principle also applies to groups, organizations, countries, or any aspect of social units.
Understanding the surrounding contexts and the related forces that act on a network helps with machine learning tasks in multiple ways:
Networks are conduits for both wanted and unwanted flows. Marketers are always trying to reach and persuade people. Personal contact is most effective, if one can find a way to start a snowball rolling. This concept is at the base of so-called viral marketing.
Understanding such forces allows the prediction of how the network will evolve over time, and enables data scientists to proactively react to such changes or use them for specific business purposes.
Findings in sociological and psychological disciplines point to the relevance of a person’s social network in determining their tastes, preferences, and activities. This information is useful for building recommendation engines. One of the problems related to recommendation engines is the cold-start problem: you can’t predict anything for a new user because you have no history for them. Social networks and the homophily principle can be used to make a recommendation based on the tastes of connected users.
1.4 The role of graphs in machine learning
Graphs are used to characterize interactions between objects of interest, to model simple and complex networks, or in general to represent real-world problems. Because they are based on a rigorous but straightforward formalism, they are used in many scientific fields, from computer science to historical sciences. We shouldn’t be surprised to see them being widely used in machine learning as a powerful tool that can enable intuition and power a lot of useful features. Graph-based machine learning is becoming more