Ebook1,026 pages7 hours

Graph-Powered Machine Learning

Name: Graph-Powered Machine Learning
Author: Alessandro Negro
ISBN: 9781638353935

By Alessandro Negro

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Upgrade your machine learning models with graph-based algorithms, the perfect structure for complex and interlinked data.

Summary
In Graph-Powered Machine Learning, you will learn:

    The lifecycle of a machine learning project
    Graphs in big data platforms
    Data source modeling using graphs
    Graph-based natural language processing, recommendations, and fraud detection techniques
    Graph algorithms
    Working with Neo4J

Graph-Powered Machine Learning teaches to use graph-based algorithms and data organization strategies to develop superior machine learning applications. You’ll dive into the role of graphs in machine learning and big data platforms, and take an in-depth look at data source modeling, algorithm design, recommendations, and fraud detection. Explore end-to-end projects that illustrate architectures and help you optimize with best design practices. Author Alessandro Negro’s extensive experience shines through in every chapter, as you learn from examples and concrete scenarios based on his work with real clients!

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the technology
Identifying relationships is the foundation of machine learning. By recognizing and analyzing the connections in your data, graph-centric algorithms like K-nearest neighbor or PageRank radically improve the effectiveness of ML applications. Graph-based machine learning techniques offer a powerful new perspective for machine learning in social networking, fraud detection, natural language processing, and recommendation systems.

About the book
Graph-Powered Machine Learning teaches you how to exploit the natural relationships in structured and unstructured datasets using graph-oriented machine learning algorithms and tools. In this authoritative book, you’ll master the architectures and design practices of graphs, and avoid common pitfalls. Author Alessandro Negro explores examples from real-world applications that connect GraphML concepts to real world tasks.

What's inside

    Graphs in big data platforms
    Recommendations, natural language processing, fraud detection
    Graph algorithms
    Working with the Neo4J graph database

About the reader
For readers comfortable with machine learning basics.

About the author
Alessandro Negro is Chief Scientist at GraphAware. He has been a speaker at many conferences, and holds a PhD in Computer Science.

Table of Contents
PART 1 INTRODUCTION
1 Machine learning and graphs: An introduction
2 Graph data engineering
3 Graphs in machine learning applications
PART 2 RECOMMENDATIONS
4 Content-based recommendations
5 Collaborative filtering
6 Session-based recommendations
7 Context-aware and hybrid recommendations
PART 3 FIGHTING FRAUD
8 Basic approaches to graph-powered fraud detection
9 Proximity-based algorithms
10 Social network analysis against fraud
PART 4 TAMING TEXT WITH GRAPHS
11 Graph-based natural language processing
12 Knowledge graphs

Skip carousel

LanguageEnglish

PublisherManning

Release dateOct 5, 2021

ISBN9781638353935

Author

Alessandro Negro

Related authors

Skip carousel

Related to Graph-Powered Machine Learning

Related ebooks

Skip carousel

Deep Reinforcement Learning in Action
Ebook
Deep Reinforcement Learning in Action
byBrandon Brown
Rating: 4 out of 5 stars
4/5
Probabilistic Deep Learning: With Python, Keras and TensorFlow Probability
Ebook
Probabilistic Deep Learning: With Python, Keras and TensorFlow Probability
byBeate Sick
Rating: 0 out of 5 stars
0 ratings
MLOps Engineering at Scale
Ebook
MLOps Engineering at Scale
byCarl Osipov
Rating: 0 out of 5 stars
0 ratings
Deep Learning with Structured Data
Ebook
Deep Learning with Structured Data
byMark Ryan
Rating: 0 out of 5 stars
0 ratings
Machine Learning Engineering in Action
Ebook
Machine Learning Engineering in Action
byBen Wilson
Rating: 0 out of 5 stars
0 ratings
Machine Learning Bookcamp: Build a portfolio of real-life projects
Ebook
Machine Learning Bookcamp: Build a portfolio of real-life projects
byAlexey Grigorev
Rating: 4 out of 5 stars
4/5
Think Like a Data Scientist: Tackle the data science process step-by-step
Ebook
Think Like a Data Scientist: Tackle the data science process step-by-step
byBrian Godsey
Rating: 0 out of 5 stars
0 ratings
Real-World Machine Learning
Ebook
Real-World Machine Learning
byHenrik Brink
Rating: 0 out of 5 stars
0 ratings
GANs in Action: Deep learning with Generative Adversarial Networks
Ebook
GANs in Action: Deep learning with Generative Adversarial Networks
byVladimir Bok
Rating: 0 out of 5 stars
0 ratings
Machine Learning with TensorFlow, Second Edition
Ebook
Machine Learning with TensorFlow, Second Edition
byChris Mattmann
Rating: 0 out of 5 stars
0 ratings
Grokking Artificial Intelligence Algorithms
Ebook
Grokking Artificial Intelligence Algorithms
byRishal Hurbans
Rating: 0 out of 5 stars
0 ratings
Grokking Machine Learning
Ebook
Grokking Machine Learning
byLuis Serrano
Rating: 0 out of 5 stars
0 ratings
Visualizing Graph Data
Ebook
Visualizing Graph Data
byCorey Lanum
Rating: 0 out of 5 stars
0 ratings
Interpretable AI: Building explainable machine learning systems
Ebook
Interpretable AI: Building explainable machine learning systems
byAjay Thampi
Rating: 0 out of 5 stars
0 ratings
Machine Learning Systems: Designs that scale
Ebook
Machine Learning Systems: Designs that scale
byJeffrey Smith
Rating: 0 out of 5 stars
0 ratings
Introducing Data Science: Big data, machine learning, and more, using Python tools
Ebook
Introducing Data Science: Big data, machine learning, and more, using Python tools
byDavy Cielen
Rating: 5 out of 5 stars
5/5
Feature Engineering Bookcamp
Ebook
Feature Engineering Bookcamp
bySinan Ozdemir
Rating: 0 out of 5 stars
0 ratings
Deep Learning with Keras: Beginner’s Guide to Deep Learning with Keras
Ebook
Deep Learning with Keras: Beginner’s Guide to Deep Learning with Keras
byFrank Millstein
Rating: 3 out of 5 stars
3/5
Time Series Forecasting using Deep Learning: Combining PyTorch, RNN, TCN, and Deep Neural Network Models to Provide Production-Ready Prediction Solutions
Ebook
Time Series Forecasting using Deep Learning: Combining PyTorch, RNN, TCN, and Deep Neural Network Models to Provide Production-Ready Prediction Solutions
byIvan Gridin
Rating: 0 out of 5 stars
0 ratings
Streaming Data: Understanding the real-time pipeline
Ebook
Streaming Data: Understanding the real-time pipeline
byAndrew Psaltis
Rating: 0 out of 5 stars
0 ratings
Effective Data Science Infrastructure: How to make data scientists productive
Ebook
Effective Data Science Infrastructure: How to make data scientists productive
byVille Tuulos
Rating: 0 out of 5 stars
0 ratings
Python Data Science Essentials
Ebook
Python Data Science Essentials
byBoschetti Alberto
Rating: 0 out of 5 stars
0 ratings
Managing Machine Learning Projects: From design to deployment
Ebook
Managing Machine Learning Projects: From design to deployment
bySimon Thompson
Rating: 5 out of 5 stars
5/5
Learn Quantum Computing with Python and Q#: A hands-on approach
Ebook
Learn Quantum Computing with Python and Q#: A hands-on approach
bySarah C. Kaiser
Rating: 0 out of 5 stars
0 ratings
Deep Learning for Search
Ebook
Deep Learning for Search
byTommaso Teofili
Rating: 0 out of 5 stars
0 ratings
Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI
Ebook
Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI
byRobert (Munro) Monarch
Rating: 0 out of 5 stars
0 ratings
Machine Learning in Action
Ebook
Machine Learning in Action
byPeter Harrington
Rating: 0 out of 5 stars
0 ratings
Data Science Bookcamp: Five real-world Python projects
Ebook
Data Science Bookcamp: Five real-world Python projects
byLeonard Apeltsin
Rating: 5 out of 5 stars
5/5
Deep Learning for Vision Systems
Ebook
Deep Learning for Vision Systems
byMohamed Elgendy
Rating: 5 out of 5 stars
5/5
Transfer Learning for Natural Language Processing
Ebook
Transfer Learning for Natural Language Processing
byPaul Azunre
Rating: 0 out of 5 stars
0 ratings

Intelligence (AI) & Semantics For You

Skip carousel

Artificial Intelligence: A Guide for Thinking Humans
Ebook
Artificial Intelligence: A Guide for Thinking Humans
byMelanie Mitchell
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 4 out of 5 stars
4/5
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
The Alignment Problem: How Can Machines Learn Human Values?
Ebook
The Alignment Problem: How Can Machines Learn Human Values?
byBrian Christian
Rating: 4 out of 5 stars
4/5
Deep Utopia: Life and Meaning in a Solved World
Ebook
Deep Utopia: Life and Meaning in a Solved World
byNick Bostrom
Rating: 0 out of 5 stars
0 ratings
Algorithms to Live By: The Computer Science of Human Decisions
Ebook
Algorithms to Live By: The Computer Science of Human Decisions
byBrian Christian
Rating: 4 out of 5 stars
4/5
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
Ebook
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
The Secrets of ChatGPT Prompt Engineering for Non-Developers
Ebook
The Secrets of ChatGPT Prompt Engineering for Non-Developers
byCea West
Rating: 5 out of 5 stars
5/5
Midjourney Mastery - The Ultimate Handbook of Prompts
Ebook
Midjourney Mastery - The Ultimate Handbook of Prompts
byAndreea Todinca
Rating: 5 out of 5 stars
5/5
2084: Artificial Intelligence and the Future of Humanity
Ebook
2084: Artificial Intelligence and the Future of Humanity
byJohn C. Lennox
Rating: 4 out of 5 stars
4/5
How To Become A Data Scientist With ChatGPT: A Beginner's Guide to ChatGPT-Assisted Programming
Ebook
How To Become A Data Scientist With ChatGPT: A Beginner's Guide to ChatGPT-Assisted Programming
byRafiq Muhammad
Rating: 4 out of 5 stars
4/5
The ChatGPT Revolution: How to Simplify Your Work and Life Admin with AI
Ebook
The ChatGPT Revolution: How to Simplify Your Work and Life Admin with AI
byDonna McGeorge
Rating: 0 out of 5 stars
0 ratings
ChatGPT For Dummies
Ebook
ChatGPT For Dummies
byPam Baker
Rating: 4 out of 5 stars
4/5
Summary of Super-Intelligence From Nick Bostrom
Ebook
Summary of Super-Intelligence From Nick Bostrom
bySummary Station
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert
Ebook
Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert
byTJ Books
Rating: 3 out of 5 stars
3/5
ChatGPT
Ebook
ChatGPT
byGary Stevens
Rating: 3 out of 5 stars
3/5
Rise of Generative AI and ChatGPT: Understand how Generative AI and ChatGPT are transforming and reshaping the business world (English Edition)
Ebook
Rise of Generative AI and ChatGPT: Understand how Generative AI and ChatGPT are transforming and reshaping the business world (English Edition)
byUtpal Chakraborty
Rating: 0 out of 5 stars
0 ratings
Impromptu: Amplifying Our Humanity Through AI
Ebook
Impromptu: Amplifying Our Humanity Through AI
byReid Hoffman
Rating: 5 out of 5 stars
5/5
The Business Case for AI: A Leader's Guide to AI Strategies, Best Practices & Real-World Applications
Ebook
The Business Case for AI: A Leader's Guide to AI Strategies, Best Practices & Real-World Applications
byKavita Ganesan
Rating: 0 out of 5 stars
0 ratings
ChatGPT for Beginners: How to Make Money Online and 10x Your Productivity Using ChatGPT Even if You’re an Absolute Beginner (The Complete Up-to-Date ChatGPT Guide)
Ebook
ChatGPT for Beginners: How to Make Money Online and 10x Your Productivity Using ChatGPT Even if You’re an Absolute Beginner (The Complete Up-to-Date ChatGPT Guide)
byMatthew Hayes
Rating: 0 out of 5 stars
0 ratings
Our Final Invention: Artificial Intelligence and the End of the Human Era
Ebook
Our Final Invention: Artificial Intelligence and the End of the Human Era
byJames Barrat
Rating: 4 out of 5 stars
4/5
Thinking in Algorithms: Strategic Thinking Skills, #2
Ebook
Thinking in Algorithms: Strategic Thinking Skills, #2
byAlbert Rutherford
Rating: 4 out of 5 stars
4/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
101 Midjourney Prompt Secrets
Ebook
101 Midjourney Prompt Secrets
byMarcus Byrne
Rating: 3 out of 5 stars
3/5
Prompt Engineering ; The Future Of Language Generation
Ebook
Prompt Engineering ; The Future Of Language Generation
byMichael Ferguson
Rating: 3 out of 5 stars
3/5
The Roadmap to AI Mastery: A Guide to Building and Scaling Projects
Ebook
The Roadmap to AI Mastery: A Guide to Building and Scaling Projects
bySomdip Dey
Rating: 3 out of 5 stars
3/5
Writing AI Prompts For Dummies
Ebook
Writing AI Prompts For Dummies
byStephanie Diamond
Rating: 0 out of 5 stars
0 ratings
Advances in Financial Machine Learning
Ebook
Advances in Financial Machine Learning
byMarcos Lopez de Prado
Rating: 5 out of 5 stars
5/5
Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition
Ebook
Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition
bySebastian Raschka
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

#111 The Rise of the Julia Programming Language
Podcast episode
#111 The Rise of the Julia Programming Language
byDataFramed
0 ratings
0% found this document useful
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
Podcast episode
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
Leveling Up Natural Language Processing with Transfer Learning: An interview with Paul Azunre about how you can use transfer learning techniques to build more flexible natural language processing systems and reduce the requirements for labelled data.
Podcast episode
Leveling Up Natural Language Processing with Transfer Learning: An interview with Paul Azunre about how you can use transfer learning techniques to build more flexible natural language processing systems and reduce the requirements for labelled data.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
Exploring deep reinforcement learning: with Thomas Simonini of Hugging Face
Podcast episode
Exploring deep reinforcement learning: with Thomas Simonini of Hugging Face
byPractical AI: Machine Learning, Data Science, LLM
0 ratings
0% found this document useful
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
Podcast episode
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
It’s Not a Data Science Problem, It’s a Data Engineering Problem with Laurie Voss: Laurie Voss is a senior data analyst at Netlify, makers of a serverless platform designed to help teams build, deploy, and collaborate on web apps more effectively. Previously, Laurie worked as Chief Data Officer at npm, Inc., co-founded Snowball Factory,
Podcast episode
It’s Not a Data Science Problem, It’s a Data Engineering Problem with Laurie Voss: Laurie Voss is a senior data analyst at Netlify, makers of a serverless platform designed to help teams build, deploy, and collaborate on web apps more effectively. Previously, Laurie worked as Chief Data Officer at npm, Inc., co-founded Snowball Factory,
byScreaming in the Cloud
0 ratings
0% found this document useful
#40 Becoming a Data Scientist
Podcast episode
#40 Becoming a Data Scientist
byDataFramed
100%
100% found this document useful
Eureka moments with natural language processing: featuring Nicholas Mohnacky of bundleIQ
Podcast episode
Eureka moments with natural language processing: featuring Nicholas Mohnacky of bundleIQ
byPractical AI: Machine Learning, Data Science, LLM
0 ratings
0% found this document useful
Machines Learn Better if We Teach Them the Basics: To improve reinforcement learning algorithms, computer scientists are training them as if they were human.
Podcast episode
Machines Learn Better if We Teach Them the Basics: To improve reinforcement learning algorithms, computer scientists are training them as if they were human.
byQuanta Science Podcast
0 ratings
0% found this document useful
Modern Software Engineering: delivered continuously with Dave Farley
Podcast episode
Modern Software Engineering: delivered continuously with Dave Farley
byShip It! Cloud, SRE, Platform Engineering
0 ratings
0% found this document useful
#70 Beyond the Language Wars: R & Python for the Modern Data Scientist
Podcast episode
#70 Beyond the Language Wars: R & Python for the Modern Data Scientist
byDataFramed
0 ratings
0% found this document useful
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
Podcast episode
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
byDataFramed
0 ratings
0% found this document useful
The Past, Present, and Future of Deep Learning In PyTorch: An interview with the creator of the popular PyTorch deep learning framework
Podcast episode
The Past, Present, and Future of Deep Learning In PyTorch: An interview with the creator of the popular PyTorch deep learning framework
byThe Python Podcast.__init__
0 ratings
0% found this document useful
Episode 19 (Python for Data Science - Python Files - Scripts and Modules)
Podcast episode
Episode 19 (Python for Data Science - Python Files - Scripts and Modules)
byHow to Data (Joshiverse- Journey of a Budding Data Scientist)
0 ratings
0% found this document useful
001 Introduction: Teaches the high level fundamentals of machine learning and artificial intelligence. I teach basic intuition, algorithms, and math. I discuss languages and frameworks, deep learning, and more. ocdevel.com/mlg/1 for notes and resources
Podcast episode
001 Introduction: Teaches the high level fundamentals of machine learning and artificial intelligence. I teach basic intuition, algorithms, and math. I discuss languages and frameworks, deep learning, and more. ocdevel.com/mlg/1 for notes and resources
byMachine Learning Guide
0 ratings
0% found this document useful
2: Pytest vs Unittest vs Nose: Choosing a test framework
Podcast episode
2: Pytest vs Unittest vs Nose: Choosing a test framework
byTest and Code
0 ratings
0% found this document useful
Exploring Large Language Models with ChatGPT - #603
Podcast episode
Exploring Large Language Models with ChatGPT - #603
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
AI Today Podcast #114: Patterns of AI – Predictive Analytics / Decision Support: Patterns of AI: Predictive Analytics / Decision Support
Podcast episode
AI Today Podcast #114: Patterns of AI – Predictive Analytics / Decision Support: Patterns of AI: Predictive Analytics / Decision Support
byAI Today Podcast: Artificial Intelligence Insights, Experts, and Opinion
0 ratings
0% found this document useful
#71 - Strategic Monoliths and Microservices - Vaughn Vernon
Podcast episode
#71 - Strategic Monoliths and Microservices - Vaughn Vernon
byTech Lead Journal
0 ratings
0% found this document useful
SOLID Principles with Uncle Bob - Robert C. Martin: Scott sits down with Robert C. Martin as Uncle Bob helps Scott understand the SOLID Principles of Object Oriented Design.
Podcast episode
SOLID Principles with Uncle Bob - Robert C. Martin: Scott sits down with Robert C. Martin as Uncle Bob helps Scott understand the SOLID Principles of Object Oriented Design.
byHanselminutes with Scott Hanselman
0 ratings
0% found this document useful
#1 Data Science, Past, Present and Future: Hilary Mason talks about the past, present, and future of data science with Hugo. Hilary is the VP of Research at Cloudera Fast Forward, a machine intelligence research company, and the data scientist in residence at Accel. If you want to hear about wh...
Podcast episode
#1 Data Science, Past, Present and Future: Hilary Mason talks about the past, present, and future of data science with Hugo. Hilary is the VP of Research at Cloudera Fast Forward, a machine intelligence research company, and the data scientist in residence at Accel. If you want to hear about wh...
byDataFramed
100%
100% found this document useful
S1:E1 "The Beginning"
Podcast episode
S1:E1 "The Beginning"
byData Science Now
0 ratings
0% found this document useful
MLA 020 Kubeflow: Conversation with Dirk-Jan Kubeflow (vs cloud native solutions like SageMaker) - Data Scientist at Dept Agency . (From the website:) The Machine Learning Toolkit for Kubernetes. The Kubeflow project is dedicated to making deployments of...
Podcast episode
MLA 020 Kubeflow: Conversation with Dirk-Jan Kubeflow (vs cloud native solutions like SageMaker) - Data Scientist at Dept Agency . (From the website:) The Machine Learning Toolkit for Kubernetes. The Kubeflow project is dedicated to making deployments of...
byMachine Learning Guide
0 ratings
0% found this document useful
Recurrent Neural Nets: This week, we're doing a crash course in recurren…
Podcast episode
Recurrent Neural Nets: This week, we're doing a crash course in recurren…
byLinear Digressions
0 ratings
0% found this document useful
Cutting Through The Noise And Focusing On The Fundamentals Of Data Engineering With The Data Janitor - Episode 151: In this episode Daniel Molnar shares his experiences as a data janitor and the foundational elements of data engineering, as well as his work to build a practical bootcamp for data engineers.
Podcast episode
Cutting Through The Noise And Focusing On The Fundamentals Of Data Engineering With The Data Janitor - Episode 151: In this episode Daniel Molnar shares his experiences as a data janitor and the foundational elements of data engineering, as well as his work to build a practical bootcamp for data engineers.
byData Engineering Podcast
0 ratings
0% found this document useful
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
Podcast episode
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
byData Engineering Podcast
0 ratings
0% found this document useful
[MINI] Long Short Term Memory: Thanks to our sponsor brilliant.org/dataskeptics A Long Short Term Memory (LSTM) is a neural unit, often used in Recurrent Neural Network (RNN) which attempts to provide the network the capacity to store information for longer periods of time. An...
Podcast episode
[MINI] Long Short Term Memory: Thanks to our sponsor brilliant.org/dataskeptics A Long Short Term Memory (LSTM) is a neural unit, often used in Recurrent Neural Network (RNN) which attempts to provide the network the capacity to store information for longer periods of time. An...
byData Skeptic
0 ratings
0% found this document useful
The rise of NoSQL: In the past decade, NoSQL has gone from being an interesting experiment to becoming business critical. We catch up with Martin Fowler and Pramod Sadalage, co-authors of NoSQL Distilled, to understand why the database technology took off and where...
Podcast episode
The rise of NoSQL: In the past decade, NoSQL has gone from being an interesting experiment to becoming business critical. We catch up with Martin Fowler and Pramod Sadalage, co-authors of NoSQL Distilled, to understand why the database technology took off and where...
byThoughtworks Technology Podcast
0 ratings
0% found this document useful
Introduction to Data Mesh
Podcast episode
Introduction to Data Mesh
byThe Cloudcast
0 ratings
0% found this document useful
Data Mechanics: Data Engineering with Jean-Yves Stephan: Apache Spark is a popular open source analytics engine for large-scale data processing. Applications can be written in Java, Scala, Python, R, and SQL. These applications have flexible options to run on like Kubernetes or in the cloud.
Podcast episode
Data Mechanics: Data Engineering with Jean-Yves Stephan: Apache Spark is a popular open source analytics engine for large-scale data processing. Applications can be written in Java, Scala, Python, R, and SQL. These applications have flexible options to run on like Kubernetes or in the cloud.
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful

Skip carousel

Tensor Flow 101
APC
Article
Tensor Flow 101
Jan 27, 2020
4 min read
The Fundamental Limits of Machine Learning
Nautilus
Article
The Fundamental Limits of Machine Learning
Sep 20, 2016
5 min read
How Image Recognition Works
APC
Article
How Image Recognition Works
Nov 4, 2019
4 min read
Manipulate Data Like A Pro With Pandas
Linux Format
Article
Manipulate Data Like A Pro With Pandas
Jul 27, 2021
7 min read
Create A RESTful Server In Go
Linux Format
Article
Create A RESTful Server In Go
Oct 19, 2021
8 min read
An Introduction To Rabbitmq
Linux Format
Article
An Introduction To Rabbitmq
Jun 29, 2021
RabbitMQ is a Message Broker, which means that it can safely hold messages generated by applications and make them available to other applications. The main advantages are reliability, support for clustering and high-availability queues, tracing capa
1 min read
Over The Edge
Linux Format
Article
Over The Edge
Nov 19, 2019
9 min read
Letter of the Month
MacLife
Article
Letter of the Month
Jan 27, 2017
I read with much interest your article on email encryption in issues #121 and #122. The step-bystep instructions are very useful for those who never set up email encryption or know about digital certificates. I would like to recommend that in a futur
2 min read
CAD Solid Modelling of the Early Days NSWGR Workman's Van
Australian Model Railway Magazine
Article
CAD Solid Modelling of the Early Days NSWGR Workman's Van
Nov 8, 2022
3 min read
Will Public See Past AR’s Flaws This Time?
PC Pro Magazine
Article
Will Public See Past AR’s Flaws This Time?
Jul 9, 2022
5 min read
Safer Cyber
Cosmos Magazine
Article
Safer Cyber
Mar 14, 2024
3 min read
Intel ...ON TE FUTURE OF... Computing
TechLife
Article
Intel ...ON TE FUTURE OF... Computing
Jan 13, 2020
5 min read
The Algorithmic Leader
Rotman Management
Article
The Algorithmic Leader
Jan 1, 2020
9 min read
Intel …ON THE FUTURE OF… Computing
T3 Australia
Article
Intel …ON THE FUTURE OF… Computing
Nov 4, 2019
5 min read
Intel …ON THE FUTURE OF… Computing
T3
Article
Intel …ON THE FUTURE OF… Computing
Sep 27, 2019
5 min read
Getting Closer To Machines With Mindful Steps
The European Business Review
Article
Getting Closer To Machines With Mindful Steps
Jan 26, 2024
6 min read
Picture In A Mainframe
Linux Format
Article
Picture In A Mainframe
Jul 2, 2019
11 min read
Mailserver
Linux Format
Article
Mailserver
Feb 7, 2023
4 min read
“Reputations Are Going To Be Staked On How ‘The Computer’ Goes About Making Decisions”
PC Pro Magazine
Article
“Reputations Are Going To Be Staked On How ‘The Computer’ Goes About Making Decisions”
Jun 10, 2021
We live lonely lives here sometimes. The type of critic who sees patterns in everything loves to tell me that I’m in the pockets of PC Pro advertisers, and that we all toe the party line – most recently over systems such as the Raspberry Pi 400 or an
6 min read
Questions for Tim Brown, CEO, IDEO
Rotman Management
Article
Questions for Tim Brown, CEO, IDEO
Jan 1, 2018
You have said that, at its best, design creates relationships between people and technologies. Please explain. When I use the term ‘technologies’, I mean anything that is constructed by human beings — whether it’s an iPod, an automobile, a rapid tran
8 min read
Why a Hedge Fund Started a Video Game Competition
Nautilus
Article
Why a Hedge Fund Started a Video Game Competition
Nov 30, 2017
There’s a weird way in which a hedge fund is a confluence of everything. There’s the money of course—Two Sigma, located in lower Manhattan, manages over $50 billion, an amount that has grown 600 percent in 6 years and is roughly the size of the econo
9 min read
The Worst Ideas For The Future
PC Pro Magazine
Article
The Worst Ideas For The Future
Jan 5, 2023
8 min read
AI In Action: Case Studies
Architecture Australia
Article
AI In Action: Case Studies
Mar 4, 2024
9 min read
For More Trustworthy AI, We May Need an ‘Interpreter’
Futurity
Article
For More Trustworthy AI, We May Need an ‘Interpreter’
Jul 6, 2017
A team of researchers is working to build trust between humans and artificial intelligence (AI) by creating an “interpreter” that can explain how an AI arrived at the answer to a specific question. In an age of self-driving cars and autonomous drones
4 min read
How Can AI Help Your Business?
PC Pro Magazine
Article
How Can AI Help Your Business?
Jun 8, 2023
7 min read
Quantum Leap
Marketing
Article
Quantum Leap
Jul 11, 2019
6 min read
Search and Destroy
Fast Company
Article
Search and Destroy
May 14, 2024
2 min read
Your Creativity Won’t Save Your Job From AI
The Atlantic
Article
Your Creativity Won’t Save Your Job From AI
Dec 1, 2022
5 min read
How And Where You Use Machine-learning
APC
Article
How And Where You Use Machine-learning
Oct 7, 2019
4 min read
Roundtable: AI And The Practice Of Architecture
Architecture Australia
Article
Roundtable: AI And The Practice Of Architecture
Mar 4, 2024
Engagement in practice Gwyllim Jahn: To begin with, we are interested in understanding how AI is being used in your practices. Xavier, could you explain who in your practice is using AI and what they’re using it for? Xavier De Kestelier: We work clos
10 min read

Related categories

Skip carousel

Reviews for Graph-Powered Machine Learning

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Graph-Powered Machine Learning - Alessandro Negro

Part 1 Introduction

We are surrounded by graphs. Facebook, LinkedIn, and Twitter are the most famous examples of social networks—that is, graphs of people. Other types of graphs exist even though we don’t think of them as such: electrical or power networks, the tube, and so on.

Graphs are powerful structures useful not only for representing connected information, but also for supporting multiple types of analysis. Their simple data model, consisting of two basic concepts such as nodes and relationships, is flexible enough to store complex information. If you also store properties in nodes and relationships, it is possible to represent practically everything of any size.

Furthermore, in a graph every single node and every single relationship is an access point for analysis, and from an access point, it is possible to navigate the rest in an endless way, which provides multiple access patterns and analysis potentials.

Machine learning, on the other side, provides tools and techniques for making representations of reality and providing predictions. Recommendation is a good example; the algorithm takes what the users interacted with and is capable of predicting what they will be interested in. Fraud detection is another one, taking the previous transactions (legit or not) and creating a model that can recognize with a good approximation whether a new transaction is fraudulent.

The performance of machine learning algorithms, both in terms of accuracy and speed, is affected almost directly from the way in which we represent our training data and store our prediction model. The quality of algorithm prediction is as good as the quality of the training dataset. Data cleansing and feature selection, among other tasks, are mandatory if we would like to achieve a reasonable level of trust in the prediction. The speed at which the system provides prediction affects the usability of the entire product. Suppose that a recommendation algorithm for an online retailer produced recommendations in 3 minutes. By that time, the user would be on another page or, worse, on a competitor’s website.

Graphs can support machine learning by doing what they do best: representing data in a way that is easily understandable and easily accessible. Graphs make all the necessary processes faster, more accurate, and much more effective. Moreover, graph algorithms are powerful tools for machine learning practitioners. Graph community detection algorithms can help identify groups of people, page rank can reveal the most relevant keywords in a text, and so on.

If you didn’t fully understand some of the terms and concepts presented in the introduction, the first part of the book will provide you all the knowledge you need to move further in the book. It introduces the basic concepts related to graphs and machine learning as single, independent entities and as powerful binomials. Let me wish you good reading!

1 Machine learning and graphs: An introduction

This chapter covers

An introduction to machine learning

An introduction to graphs

The role of graphs in machine learning applications

Machine learning is a core branch of artificial intelligence: it is the field of study in computer science that allows computer programs to learn from data. The term was coined in 1959, when Arthur Samuel, an IBM computer scientist, wrote the first computer program to play checkers [Samuel, 1959]. He had a clear idea in mind:

Programming computers to learn from experience should eventually eliminate the need for much of this detailed programming effort.

Samuel wrote his initial program by assigning a score to each board position based on a fixed formula. This program worked quite well, but in a second approach, he had the program execute thousands of games against itself and used the results to refine the board scoring. Eventually, the program reached the proficiency of a human player, and machine learning took its first steps.

An entity—such as a person, an animal, an algorithm, or a generic computer agent¹—is learning if, after making observations about the world, it is able to improve its performance on future tasks. In other words, learning is the process of converting experience to expertise or knowledge [Shalev-Shwartz and Ben-David, 2014]. Learning algorithms use training data that represents experience as input and create expertise as output. That output can be a computer program, a complex predictive model, or tuning of internal variables. The definition of performance depends on the specific algorithm or goal to be achieved; in general, we consider it to be the extent to which the prediction matches specific needs.

Let’s describe the learning process with an example. Consider the implementation of a spam filter for emails. A pure programming solution would be to write a program to memorize all the emails labeled as spam by a human user. When a new email arrives, the pseudoagent will search for a similar match in the previous spam emails, and if it finds any matches, the new email will be rerouted to the trash folder. Otherwise, the email will pass through the filter untouched.

This approach could work and, in some scenarios, be useful. Yet it is not a learning process because it lacks an important aspect of learning: the ability to generalize, to transform the individual examples into a broader model. In this specific use case, it means the ability to label unseen emails even though they are dissimilar to previously labeled emails. This process is also referred to as inductive reasoning or inductive inference.² To generalize, the algorithm should scan the training data and extract a set of words whose appearance in an email message is indicative of spam. Then, for a new email, the agent would check whether one or more of the suspicious words appear and predict its label accordingly.

If you are an experienced developer, you might be wondering, Why should I write a program that learns how to program itself, when I can instruct the computer to carry out the task at hand? Taking the example of the spam filter, it is possible to write a program that checks for the occurrence of some words and classifies an email as spam if those words are present. But this approach has three primary disadvantages:

A developer cannot anticipate all possible situations. In the spam-filter use case, all the words that might be used in a spam email cannot be predicted up front.

A developer cannot anticipate all changes over time. In spam emails, new words can be used, or techniques can be adopted to avoid easy recognition, such as adding hyphens or spaces between characters.

Sometimes, a developer cannot write a program to accomplish the task. Even though recognizing the face of a friend is a simple task for a human, for example, it is impossible to program software to accomplish this task without the use of machine learning.

Therefore, when you face new problems or tasks that you would like to solve with a computer program, the following questions can help you decide whether to use machine learning:

Is the specific task too complex to be programmed?

Does the task require any sort of adaptivity throughout its life?

A crucial aspect of any machine learning task is the training data on which the knowledge is built. Starting from the wrong data leads to the wrong results, regardless of the potential performance or the quality of the learning algorithm used.

The aim of this book is to help data scientists and data engineers approach the machine learning process from two sides: the learning algorithm and the data. In both perspectives, we will use the graph (let me introduce it now as a set of nodes and relationships connecting them) as a valuable mental and technological model. A lot of learning algorithms based on data represented as graphs can deliver efficient predictive models, and other algorithms can be improved by using either data represented as graphs or graph algorithms in the workflow. The use of graphs also provides many other benefits: graphics are a valuable storage model for representing knowledge from the input of the process, managing the training data, and storing the output of the predictive model, providing multiple ways to access it quickly. This book walks the reader through the entire machine learning project life cycle, showing step by step all cases in which graphs could be valuable and reliable friends.

But graphs are not a panacea for all machine learning projects. In stream analytics, in which it is necessary to process a stream of data to reveal short-term anomalies, storing data in the form of a graph could be useless. Furthermore, other algorithms require data in a format that cannot fit in a graph, either during training or for model storage and access. This book gives the reader the capability to discern whether using a graph in the process would be an advantage or a burden.

1.1 Machine learning project life cycle

A machine learning project is a human process as well as a software project. It involves a large number of people, a lot of communication, a great deal of work, and a mixed set of skills, and it requires a well-defined approach to be effective. We’ll start our long journey by defining a workflow with clear steps and components that will be used throughout the book. The mental schema proposed here, which is one of many possible schemas, will help you better understand the role of graphs in the development and deployment of a successful machine learning project.

Delivering machine learning solutions is a complex process that requires more than selecting the right algorithm(s). Such projects include numerous tasks related to [Sculley, 2015]

Selecting the data sources

Gathering data

Understanding the data

Cleaning and transforming the data

Processing the data to create ML models

Evaluating the results

Deploying

After deployment, it is necessary to monitor the application and tune it. The entire process involves multiple tools, a lot of data, and different people.

One of the most commonly used processes for data mining projects is the Cross-Industry Standard Process for Data Mining, or CRISP-DM [Wirth and Hipp, 2000]. Although the CRISP-DM model was designed for data mining, it can also be applied to generic machine learning projects. Key features of the CRISP-DM that make it attractive as part of the base workflow model are

It is not proprietary.

It is application, industry, and tool neutral.

It explicitly views the data analytics process from both an application-focused and a technical perspective.

This method can be used for project planning and management, for communicating, and for documentation purposes.

The CRISP-DM reference model offers an overview of the life cycle of a machine learning project. This schema or mental model helps in approaching the machine learning project from a data perspective before taking an algorithmic point of view and provides the baseline for the definition of a clear workflow. Figure 1.1 shows the six phases of the process. It is worth noting that data is at the core of this process.

Looking at figure 1.1, we see that the sequence of the phases is fluid. The arrows indicate only the most important and frequent dependencies between phases; in a particular project, the outcome of each phase determines which phase, or which particular task of a phase, has to be performed next.

CH01_F01_Negro

Figure 1.1 The six phases of the CRISP-DM process

The outer circle symbolizes the cyclic nature of the process, which is not finished when a solution is deployed. Subsequent machine learning processes can benefit not only from the experiences of previous ones (the virtuous cycle of Linoff and Berry [2011]), but also from the outcomes of the previous processes. Let’s outline each phase in more detail.

1.1.1 Business understanding

The first phase requires defining the goals of the machine learning project. These objectives are often expressed in general terms: increase revenue, improve the customer experience, get better and customized search results, sell more products, and so on. To convert these high-level problem definitions to concrete requirements and constraints for the machine learning project, it is necessary to understand the business and the domain.

Machine learning projects are software projects, and during this phase, it is also important to learn the language and domain concepts. This knowledge will not only help with communication between the data scientist and the internal team during subsequent phases, but also improve the quality of the documentation and the presentation of the results.

The outcomes of this phase are

A clear understanding of the domain and the business perspective

A definition of the goals, requirements, and constraints

The conversion of this knowledge to a machine learning problem definition

A preliminary and reasonable project plan designed to achieve the objectives

The goals of the first iteration shouldn’t be too broad, because this round requires a lot of effort related to the injection of the machine learning process into the existing infrastructure. At the same time, it is important to keep future extensions in mind while designing the first iteration.

1.1.2 Data understanding

The data-understanding phase starts by inquiring about the data sources and collecting some data from each of them, and proceeds with these activities:

Get familiar with the data.

Identify data quality problems.

Get first insights into the data.

Detect interesting subsets to form hypotheses about hidden information.

Data understanding requires domain and business understanding. Moreover, looking at the data helps build understanding of the domain and the business perspective, which is why there is a feedback loop between this phase and the previous one.

The outcomes of this phase are

A clear understanding of the data sources available

A clear understanding of the different kinds of data and their content (or at least of all the significant parts of the machine learning goals)

An architecture design to get or extract this data and feed it into the next steps of the machine learning workflow

1.1.3 Data preparation

This phase covers all the activities to gather data from multiple sources and organize it in the specific kind of structure required by the algorithm in the modeling phase. Data preparation tasks include record and attribute selection, feature engineering, data merging, data cleaning, construction of new attributes, and enrichment of existing data. As pointed out before, the quality of the data has an enormous impact on the final results of the next phases, so this phase is crucial.

The outcomes of this phase are

One or more data structure definitions, using adequate design techniques

A well-defined data pipeline for feeding the machine learning algorithm training data

A set of procedures for merging, cleaning, and enriching data

Another outcome of this phase is the identification of the database management system where this data will be stored while waiting to be processed.

For the sake of completeness, it is not always required to have an explicit data store for persisting data before further processing. It is possible to extract data and convert it before the processing phase. Having such an intermediate step, however, has a lot of advantages related to performance and data quality as well as to further extensibility.

1.1.4 Modeling

The modeling phase is where machine learning occurs. Different algorithms are selected and applied, and their parameters are calibrated to optimal values. The algorithms are used to build a range of prediction models from which the best is selected for deployment when the evaluation phase is complete. An interesting aspect is that some algorithms produce a prediction model, but others do not.³

The outcomes of this phase are

The set of algorithms to be tested in the next phase

The related predictive models (where applicable)

There is a close link between data preparation and modeling, because you often identify problems with the data and get ideas for constructing new data points while modeling. Moreover, some techniques require specific data formats.

1.1.5 Evaluation

At this stage in a machine learning project, you have built one or more predictive models that appear to be of high quality. Before a model can be deployed, it is important to evaluate it thoroughly and review the steps executed to construct the model, so that you can be certain it properly achieves the business objectives defined at the beginning of the process.

This evaluation is conducted in a formal way, such as splitting the available data into a training set (80%) and a testing set (20%). Another primary objective is to determine whether any important business issues have not been sufficiently considered.

The outcomes of this phase are

A set of values that allows measurement of performance. (The specific measure of success depends on the algorithm type and the scope.)

A thorough evaluation of whether the business goals are achieved.

Authorization for using the solution in a production environment.

1.1.6 Deployment

Because machine learning models are built to serve some purpose in an organization, the creation of a model is generally not the end of the project. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as releasing a complete infrastructure that delivers a service to end users. In many cases, the customer—not the data scientist—carries out the deployment steps. In any event, it is important to understand up front what actions will need to be carried out to make use of the created models.

The outcomes of this phase are as follows:

One or multiple reports with the results of the prediction models

The predictive models themselves, used for predicting the future and supporting decisions

An infrastructure to provide a specific set of services to the end users

When the project is in production, it is necessary to monitor it constantly (evaluating performance, for example).

1.2 Machine learning challenges

Machine learning projects have some intrinsic challenges that make them complex to accomplish. This section summarizes the main aspects you need to take into account when approaching a new machine learning project.

1.2.1 The source of truth

The CRISP-DM model puts data at the center of the machine learning process by describing the life cycle from a data perspective. The training data represents the source of truth from which any insight can be extracted and any prediction can be made. Managing the training requires a lot of effort. To quote Jeffrey Heer, professor of computer science at the University of Washington, It’s an absolute myth that you can send an algorithm over raw data and have insights pop up. As a case in point, it has been estimated that data scientists spend up to 80% of their time on data preparation [Lohr, 2014].

I often use the following statement to shift the focus to data before discussing the details of the algorithms:

Even the best learning algorithm on the wrong data produces the wrong results.

The seminal papers by Banko and Brill [2001] and Halevy, Norvig, and Pereira [2009] point out how, for complex problems, data often matters more than algorithms. Both articles consider natural language processing, but the concept can be generalized to machine learning in general [Sculley, 2015].

Figure 1.2, from Banko and Brill [2001], shows the learning curves for a set of learners, considering the average performance of each on different sizes of training data (up to 1 billion words). The specific algorithms used are not important here; the key takeaway is that, as is clear from the image, increasing the amount of data available during the training phase improved the performance of all the learners. These results suggest that it’s worth reconsidering spending time and money on corpus development versus spending it on algorithm development. From another perspective, as a data scientist you can focus on the vertical dimension—finding a better algorithm—but the graph shows that there is more room for improvement in the horizontal direction—gathering more data. As proof, the worst-performing algorithm in figure 1.2 performs much better with 10 million elements than the best one with 1 million elements.

CH01_F02_Negro

Figure 1.2 Learning curves for confusion set disambiguation

Collecting data from multiple sources allows you not only to access a huge set of data, but also to improve the quality of the data, solving problems such as sparsity, misspellings, correctness, and so on. Gathering data from a variety of sources is not an issue; we live in the big data era, and an abundance of digital data is available from the web, sensors, smartphones, corporate databases, and open data sources. But if there is value in combining different datasets, there are problems too. Data from different sources comes in different formats. Before the learner can analyze it, the data must be cleaned up, merged, and normalized into a unified and homogeneous schema that the algorithm can understand. Furthermore, obtaining additional training data has a nonzero cost for many problems, and for supervised learning, this cost may be high.

For these reasons, the data represents the first big challenge in the machine learning process. Data concerns can be summarized in four categories:

Insufficient quantity of data—Machine learning requires a lot of training data to work properly. Even for simple use cases, it needs thousands of examples, and for complex problems such as deep learning or for nonlinear algorithms, you may need millions of examples.

Poor quality of data—Data sources are always full of errors, outliers, and noise. Poor data quality directly affects the quality of the results of the machine learning process because it is hard for many algorithms to discard wrong (incorrect, unrelated, or irrelevant) values and to then detect underlying patterns in this mess.

Nonrepresentative data—Machine learning is a process of induction: the model makes inferences from what it has observed and will likely not support edge cases that your training data does not include. Furthermore, if the training data is too noisy or is related only to a subset of possible cases, the learner might generate bias or overfit the training data and will not be able to generalize to all the possible cases. This is true for both instance-based and model-based learning algorithms.

Irrelevant features—The algorithm will learn in the right way if the data contains a good set of relevant features and not too many irrelevant features. Although it is often a useful strategy to select more features, with the goal of increasing the accuracy of the model, more is not always better. Using more features will enable the learner to find a more detailed mapping from feature to target, which increases the risk that the model computed will overfit the data. Feature selection and feature extraction represent two important tasks during the preparation of data.

To overcome these issues, data scientists have to gather and merge data from multiple sources, clean it, and enrich it by using external sources. (Moreover, it often happens that data is prepared for a certain purpose, but along the way, you discover something new, and the desired purpose changes.) These tasks are not simple ones; they require not only a significant amount of professional skill, but also a data management platform that allows the changes to be performed in a convenient way.

The issues related to the quality of the training examples determine a set of data management constraints and requirements for the infrastructure of the machine learning project. The problems can be summarized as follows:

Managing big data—Gathering data from multiple data sources and merging it into a unified source of truth will generate a huge dataset, and as pointed out before, increasing the amount of (quality) data will improve the quality of the learning process. Chapter 2 considers the characteristics of a big data platform and shows how graphs can play a prominent role in taming such a beast.

Designing a flexible schema—Try to create a schema model that provides the capabilities to merge multiple heterogeneous schemas into a unified and homogeneous data structure that satisfies the informational and navigational requirements. The schema should evolve easily according to changes in the purpose of the machine learning project. Chapter 4 introduces multiple data model schemas and best practices to model the data for several scenarios.

Developing efficient access patterns—Fast data reads boost the performance, in terms of processing time, of the training process. Tasks such as feature extraction, filtering, cleaning, merging, and other preprocessing of the training data will benefit from the use of a data platform that provides multiple and flexible access patterns.

1.2.2 Performance

Performance is a complex topic in machine learning because it can be related to multiple factors:

Predictive accuracy, which can be evaluated by using different performance measures. A typical performance measure for regression problems is the root mean squared error (RMSE), which measures the standard deviation⁴ of the errors the system makes in its predictions. In other words, it looks at the difference between the estimated value and the known value for all the samples in the test dataset and calculates the average value. I present other techniques for measuring performance later in the book, when discussing the different algorithms. Accuracy depends on several factors, such as the amount of data available for training the model, the quality of the data, and the algorithm selected. As discussed in section 1.2.1, the data plays a primary role in guaranteeing a proper level of accuracy.

Training performance, which refers to the time required to compute the model. The amount of data to be processed and the type of algorithm used determine the processing time and the storage required for computing the prediction model. Clearly, this problem affects more the algorithms that produce a model as a result of the training phase. For instance-based learners,⁵ performance problems will appear later in the process, such as during prediction. In batch learning, the training time is generally longer, due to the amount of data to be processed (compared with the online learning approach, in which the algorithm learns incrementally from a smaller amount of data). Although in online learning, the amount of data to be processed is small, the speed at which it is crunched affects the capacity of the system to be aligned with the latest data available, which directly affects the accuracy of the predictions.

Prediction performance, which refers to the response time required to provide predictions. The output of the machine learning project could be a static one-time report to help managers make strategic decisions or an online service for end users. In the first case, the time required to complete the prediction phase and compute the model is not a significant concern, so long as the work is completed in a reasonable time frame (that is, not years). In the second case, prediction speed does matter, because it affects the user experience and the efficacy of the prediction. Suppose that you are developing a recommendation engine that recommends products similar to what the user is currently viewing according to their interests. The user navigation speed is quite high, which implies the need for a significant number of predictions in a short time interval; only a few milliseconds are available to suggest something useful before the user proceeds to the next item. In this scenario, the speed of prediction is the key to success.

These factors can be translated into multiple requirements for machine learning projects, such as fast access to the data source during training, high data quality, an efficient access pattern for the model to accelerate predictions, and so on. In this context, graphs could provide the proper storage mechanism for both source and model data, reducing the access time required to read data as well as offering multiple algorithmic techniques for improving the accuracy of the predictions.

1.2.3 Storing the model

In the model-based learner approach, the output of the training phase is a model that will be used for making predictions. This model requires time to be computed and has to be stored in a persistence layer to avoid recomputation at every system restart.

The model’s structure is related directly to the specific algorithm or the algorithm class employed. Examples include

Item-to-item similarities for a recommendation engine that uses a nearest-neighbor approach

An item-cluster mapping that expresses how the elements are grouped in clusters

The sizes of the two models differ enormously. Consider a system that contains 100 items. As a first effort, item-to-item similarities would require 100 x 100 entries to be stored. Taking advantage of optimizations, this number can be reduced to consider only the top k similar items, in which case the model will require 100 x k entries. By contrast, item-cluster mapping requires only 100 entries; hence, the space required to store the model in memory or on disk could be huge or relatively modest. Moreover, as pointed out earlier, model access/query time affects global performance during the prediction phase. For these reasons, model storage management represents a significant challenge in machine learning.

1.2.4 Real time

Machine learning is being used more and more frequently to deliver real-time services to users. Examples run the full spectrum from simple recommendation engines that react to the user’s last clicks all the way to self-driving cars that have been instructed not to injure a pedestrian crossing the street. Although the outcomes in case of failure are vastly different in these two examples, in both cases the capability of the learner to react quickly (or in an appropriate timely manner) to new stimuli coming from the environment is fundamental for the quality of the final results.

Consider a recommendation engine that provides real-time recommendations for anonymous users. This anonymity (the user is not registered or logged in) means that there is no long-term history of previous interactions—only short-term, session-based information provided by the use of cookies. This task is a complex one that involves multiple aspects and affects several phases of a machine learning project. The approaches taken could differ according to the learner(s) used, but the goals can be described as follows:

Learn fast. The online learner should be able to update the model as soon as new data is available. This capability will reduce the time gap between events or generic feedback, such as navigational clicks or interaction with a search session and updating of the model. The more the model is aligned to the latest events, the more it is able to meet the current needs of the user.

Predict fast. When the model is updated, the prediction should be fast—a maximum of a few milliseconds—because the user may navigate away from the current page or even change their opinion quite rapidly.

Both of these goals require algorithms that can align the model quickly, as well as a storage mechanism (in memory, on disk, or a combined version) that provides fast memorization and efficient access patterns.

1.3 Graphs

As stated at the introduction of this chapter, graphs provide models and algorithms that can heavily support machine learning projects. Even though a graph is a simple concept, it is important to understand how to represent it and how to use the main concepts around it. This section introduces the key elements of the graph world. If you already dominate such concepts, you can skip this part.

1.3.1 What is a graph?

A graph is a simple and quite old mathematical concept: a data structure consisting of a set of vertices (or nodes/points) and edges (or relationships/lines) that can be used to model relationships among a collection of objects. Legend says that the lazy Leonhard Euler first started talking about graphs in 1736. While visiting Königsberg in Prussia (now Kaliningrad, Russia), Euler didn’t want to spend too much time walking in the city, which sat on both sides of the Pregel River and included two large islands that were connected to each other and to the two mainland portions of the city by seven bridges. Euler formalized the problem as planning a walk through the city that would cross each of those bridges once and only once. He proved that doing so was impossible, which led to the invention of graphs and graph theory [Euler, 1736]. So he stayed home instead. Figure 1.3 shows an old map of Königsberg and a representation of the graph that Euler used to prove his thesis.

CH01_F03a_NegroCH01_F03b_Negro

Figure 1.3 The bridges in Königsberg that led to the invention of graph theory

More formally, a graph is a pair G = (V, E), where V is a collection of vertices V = {Vi, i = 1,n} and E is a collection of edges over V, Eij = {(Vi, Vj), Vi ∊ V, Vj ∊ V}. E ⊆ [V]²; thus, the elements of E are two-element subsets of V [Diestel, 2017].

The simplest way to represent a graph is to draw a dot or a small circle for each vertex and then join two of those vertices by a line if they form an edge. This more formalized description is shown in figure 1.4.

CH01_F04_Negro

Figure 1.4 The undirected graph on V = {1, 2, 3, 4, 5} with edge set E = {(1,2), (1,5), (2,5), (2,4), (4,3)}

Graphs can be directed or undirected, depending on whether a direction of traversal is defined on the edges. In directed graphs, an edge Eij can be traversed from Vi to Vj but not in the opposite direction; Vi is called the tail, or start node, and Vj is called the head, or end node. In undirected graphs, edge traversals in both directions are valid. Figure 1.4 represents an undirected graph, and figure 1.5 represents a directed graph.

CH01_F05_Negro

Figure 1.5 The directed graph on V = {1, ..., 5} with edge set E = {(1,2), (2,5), (5,1), (2,4), (3,4)}

The arrow indicates the direction of the relationship. By default, edges in a graph are unweighted; thus, the corresponding graphs are said to be unweighted. When a weight—a numerical value used to convey some significance—is assigned to the edges, the graph is said to be weighted. Figure 1.6 shows the same graphs as figures 1.4 and 1.5 with a weight assigned to each edge.

CH01_F06_Negro

Figure 1.6 An undirected weighted graph (a) and a directed weighted graph (b)

Two vertices x and y of G are defined as adjacent, or neighbors, if {x,y} is an edge of G. The edge Eij connecting them is said to be incident on the two vertices Vi and Vj. Two distinct edges e and f are adjacent if they have a vertex in common. If all the vertices of G are pairwise adjacent, G is complete. Figure 1.7 shows a complete graph in which each vertex is connected to all the other vertices.

CH01_F07_Negro

Figure 1.7 A complete graph in which each vertex is connected to all the others

One of the most important properties of a vertex in a graph is its degree, defined as the total number of edges incident to that vertex, which is also equal to the number of neighbors of that vertex. In the undirected graph of figure 1.4, for example, the vertex 2 has degree 3 (it has the vertices 1, 4, and 5 as neighbors); the vertices 1 (neighbors are 2, 5), 4 (neighbors are 2, 3), and 5 (neighbors are 1, 2) have degree 2, whereas vertex 3 has degree 1 (is connected only with 4).

In a directed graph, the degree of a vertex Vi is split into the in-degree of the vertex, defined as the number of edges for which Vi is their end node (the head of the arrow) and the out-degree of the vertex, which is the number of edges for which Vi is their start node (the tail of the arrow). In the directed graph of figure 1.5, vertices 1 and 5 have an in-degree and out-degree of 1 (they each have two relationships, one ingoing and one outgoing), vertex 2 has an in-degree of 1 and an out-degree of 2 (one ingoing relationship from 1 and two outgoing to 4 and 5), vertex 4 has an in-degree of 2 and an out-degree of 0 (two ingoing relationships from 2 and 3), and vertex 3 has an out-degree of 1 and in-degree of 0 (one outgoing relationship to 4).

The average degree of a graph is computed as follows,

CH01_F07_EQ01_Negro

where N is the number of vertices in the graph.

A sequence of vertices with the property that each consecutive pair in the sequence is connected by an edge is called a path. A path with no repeating vertices is called a simple path. A cycle is a path in which the first and the last vertex coincide. In figure 1.4, [1,2,4], [1,2,4,3], [1,5,2,4,3], and so on are paths; in particular, the path of vertices [1,2,5] represents a cycle.

1.3.2 Graphs as models of networks

Graphs are useful for representing how things are either physically or logically linked in simple or complex structures. A graph in which we assign names and meanings to the edges and vertices becomes what is known as a network. In these cases, a graph is the mathematical model for describing a network, whereas a network is a set of relations between objects, which could include people, organizations, nations, items found in a Google search, brain cells, or electrical transformers. This diversity illustrates the great power of graphs and their simple structure (which also means that they require a small amount of disk storage capacity) that can be used to model⁶ a complex system.

Let’s explore this concept by using an example. Suppose that we have the graph shown in figure 1.8.

CH01_F08_Negro

Figure 1.8 A nontrivial generic graph

This graph, which is pure in the mathematical definition, can be used to model several types of networks, according to the types of edges and vertices:

A social network, if the vertices are people and each edge represents any sort of relationship between humans (friendship, family member, coworker)

An informational network, if the vertices are information structures such as web pages, documents, or papers and the edges represent logical connections such as hyperlinks, citations, or cross-references

A communication network, if the vertices are computers or other devices that can relay messages and the edges represent direct links along which messages can be transmitted

A transportation network, if the vertices are cities and the edges represent direct connections using flights or trains or roads

This small set of examples demonstrates how the same graph can represent multiple networks by assigning different semantics to edges and vertices. Figure 1.9 illustrates different types of networks.

Looking at figure 1.9, we can spot another interesting characteristic of graphs: they are highly communicative. Graphics are able to display information in a clear manner, which is why they are often used as information maps. Representing data as networks and using graph algorithms, it is possible to

Find complex patterns

Make them visible for further investigation and interpretation

CH01_F09a_NegroCH01_F09b_NegroCH01_F09c_NegroCH01_F09d_Negro

Figure 1.9 Clockwise from top left: a co-occurrence network,⁷ ARPA network 1974,⁸ London Tube network,⁹ and electrical grid¹⁰

When the power of machine learning is combined with the power of the human brain, it enables efficient, advanced, and sophisticated data processing and pattern recognition. Networks are useful for displaying data by highlighting connections among elements. Newspapers and news websites are increasingly using networks, not only to help people navigate the data, but also as a powerful investigation tool. Recently (at the time of this writing), the Panama Papers¹¹ showcased the astonishing features of networks. The International Consortium of Investigative Journalists (ICIJ) analyzed the leaked financial documents to expose highly connected networks of offshore tax structures used by the world’s richest elites. The journalists extracted the entities (people, organizations, and any sort of intermediaries) and relationships (protector, beneficiary, shareholder, director, and so on) from the documents, stored them in a network, and analyzed them by using visual tools. The results looked like figure 1.10. Here, networks, graph algorithms, and graph visualization made evident something that otherwise would have been impossible to discover by using traditional data mining tools.

CH01_F10_Negro

Figure 1.10 An example of the graph visualization for the Panama Papers

A lot of interesting examples in this direction are also available in blog posts by Valdis Krebs,¹² an organization consultant who specializes in social network applications. His work contains examples of mixing graph-powered machine learning with the human mind, passing through graph visualization. Here, we consider one of the most famous examples.

The data in figure 1.11 was gathered from Amazon.com and represents its list of the top political books purchased in the United States in 2008 [Krebs, 2012]. Krebs employed network analysis principles to the data to create a map of books related to that year’s presidential election. Two books are linked if they were often purchased by the same customer. These books are known as also-bought pairs (in that a customer who bought this book also bought that book).

CH01_F11_Negro

Figure 1.11 Network map of US political books in 2008 (Krebs, 2012)

There are three different and nonoverlapping clusters:

An Obama cluster of books in the top-left corner

A Democratic (blue) cluster in the middle

A Republican (red) cluster in the bottom-right corner

In 2008, the US political climate was highly polarized. This fact is mirrored in Amazon’s political-book data, with figure 1.11 showing a deep divide between conservative and liberal voters. There were no connections or intermediaries between red and blue books; each cluster was completely distinct from the others. As mentioned, there was a separate cluster of people reading biographies of presidential hopeful Barack Obama, but they were apparently not interested in reading or purchasing other political books.

Four years later, in 2012, the same analysis produced a network that appeared to be substantially different (figure 1.12). This network shows a lot of books that act as bridges between clusters. Moreover, potential voters appear to be reading books about both major candidates. The result is a more complex network that has no isolated clusters.

CH01_F12_Negro

Figure 1.12 Network map of US political books in 2012 [Krebs, 2012]

The example of a network of political books introduces an important aspect of networks. If a graph is a pure mathematical concept that lives in its own Platonic world,¹³ networks, as abstractions of some concrete system or ecosystem, are subjected to forces that, acting on them, change their structure. We refer to these forces as surrounding contexts : factors that exist outside the vertices and edges of a network but nonetheless affect how the network’s structure evolves over time. The nature of such contexts and the types of forces are specific to the kind of network. In social networks, for example, each individual has a distinctive set of personal characteristics, and similarities and compatibilities between two people’s characteristics influence link creation or deletion [Easley and Kleinberg, 2010].

One of the most basic notions governing the structure of social networks is homophily (from the Greek, meaning love of the same): links in a social network tend to connect people who are similar. More formally, if two people have characteristics that match in a proportion greater than expected in the population from which they are drawn or the network of which they are part, they are more likely to be connected [Verbrugge, 1977]. The converse is also true: if two people are connected, they are more likely to have common characteristics or attributes. For this reason, our friends on Facebook (for example) don’t look like a random sample of people, but are generally similar to us along ethnic, racial, and geographic dimensions; they tend to be similar to us in age, occupation, interests, beliefs, and opinions. This observation has a long history, with origins long before Mark Zuckerberg wrote his first line of code. The underlying idea can be found in the writings of Plato (Similarity begets friendship) and Aristotle (people love those who are like themselves), as well as in folk propositions such as Birds of a feather flock together. The homophily principle also applies to groups, organizations, countries, or any aspect of social units.

Understanding the surrounding contexts and the related forces that act on a network helps with machine learning tasks in multiple ways:

Networks are conduits for both wanted and unwanted flows. Marketers are always trying to reach and persuade people. Personal contact is most effective, if one can find a way to start a snowball rolling. This concept is at the base of so-called viral marketing.

Understanding such forces allows the prediction of how the network will evolve over time, and enables data scientists to proactively react to such changes or use them for specific business purposes.

Findings in sociological and psychological disciplines point to the relevance of a person’s social network in determining their tastes, preferences, and activities. This information is useful for building recommendation engines. One of the problems related to recommendation engines is the cold-start problem: you can’t predict anything for a new user because you have no history for them. Social networks and the homophily principle can be used to make a recommendation based on the tastes of connected users.

1.4 The role of graphs in machine learning

Graphs are used to characterize interactions between objects of interest, to model simple and complex networks, or in general to represent real-world problems. Because they are based on a rigorous but straightforward formalism, they are used in many scientific fields, from computer science to historical sciences. We shouldn’t be surprised to see them being widely used in machine learning as a powerful tool that can enable intuition and power a lot of useful features. Graph-based machine learning is becoming more

Enjoying the preview?

Page 1 of 1

Graph-Powered Machine Learning

About this ebook

Alessandro Negro

Related authors

Related to Graph-Powered Machine Learning

Related ebooks

Deep Reinforcement Learning in Action

Probabilistic Deep Learning: With Python, Keras and TensorFlow Probability

MLOps Engineering at Scale

Deep Learning with Structured Data

Machine Learning Engineering in Action

Machine Learning Bookcamp: Build a portfolio of real-life projects

Think Like a Data Scientist: Tackle the data science process step-by-step

Real-World Machine Learning

GANs in Action: Deep learning with Generative Adversarial Networks

Machine Learning with TensorFlow, Second Edition

Grokking Artificial Intelligence Algorithms

Grokking Machine Learning

Visualizing Graph Data

Interpretable AI: Building explainable machine learning systems

Machine Learning Systems: Designs that scale

Introducing Data Science: Big data, machine learning, and more, using Python tools

Feature Engineering Bookcamp

Deep Learning with Keras: Beginner’s Guide to Deep Learning with Keras

Time Series Forecasting using Deep Learning: Combining PyTorch, RNN, TCN, and Deep Neural Network Models to Provide Production-Ready Prediction Solutions

Streaming Data: Understanding the real-time pipeline

Effective Data Science Infrastructure: How to make data scientists productive

Python Data Science Essentials

Managing Machine Learning Projects: From design to deployment

Learn Quantum Computing with Python and Q#: A hands-on approach

Deep Learning for Search

Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI

Machine Learning in Action

Data Science Bookcamp: Five real-world Python projects

Deep Learning for Vision Systems

Transfer Learning for Natural Language Processing

Intelligence (AI) & Semantics For You

Artificial Intelligence: A Guide for Thinking Humans

Mastering ChatGPT: 21 Prompts Templates for Effortless Writing

Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates

The Alignment Problem: How Can Machines Learn Human Values?

Deep Utopia: Life and Meaning in a Solved World

Algorithms to Live By: The Computer Science of Human Decisions

Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention

Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees

The Secrets of ChatGPT Prompt Engineering for Non-Developers

Midjourney Mastery - The Ultimate Handbook of Prompts

2084: Artificial Intelligence and the Future of Humanity

How To Become A Data Scientist With ChatGPT: A Beginner's Guide to ChatGPT-Assisted Programming

The ChatGPT Revolution: How to Simplify Your Work and Life Admin with AI

ChatGPT For Dummies

Summary of Super-Intelligence From Nick Bostrom

Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert

ChatGPT

Rise of Generative AI and ChatGPT: Understand how Generative AI and ChatGPT are transforming and reshaping the business world (English Edition)

Impromptu: Amplifying Our Humanity Through AI

The Business Case for AI: A Leader's Guide to AI Strategies, Best Practices & Real-World Applications

ChatGPT for Beginners: How to Make Money Online and 10x Your Productivity Using ChatGPT Even if You’re an Absolute Beginner (The Complete Up-to-Date ChatGPT Guide)

Our Final Invention: Artificial Intelligence and the End of the Human Era

Thinking in Algorithms: Strategic Thinking Skills, #2

Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)

101 Midjourney Prompt Secrets

Prompt Engineering ; The Future Of Language Generation

The Roadmap to AI Mastery: A Guide to Building and Scaling Projects

Writing AI Prompts For Dummies

Advances in Financial Machine Learning

Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition

Related podcast episodes

Related articles

Related categories

Reviews for Graph-Powered Machine Learning

What did you think?

Book preview

Graph-Powered Machine Learning - Alessandro Negro

Part 1 Introduction

1 Machine learning and graphs: An introduction

This chapter covers

1.1 Machine learning project life cycle

1.1.1 Business understanding

1.1.2 Data understanding