Building A Decision Tree Classifier From Scratch
Building A Decision Tree Classifier From Scratch
Building A Decision Tree Classifier From Scratch
1 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python
Table of Contents
1. Introduction
2. Algorithm's Logic
3. Fundamental Concepts
4. The Structure of a Decision Tree Classifier
5. Implementation in Python
a. Calculate Gini Impurity
b. Find the Best Split
c. Create Decision Nodes and Leaf Nodes
d. Build the Tree Recursively
e. Prune the Tree
f. Making Predictions
6. Conclusion
2 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python
In this guide, we will explore the fundamental concepts and steps involved in building a Decision Tree
Classifier from scratch in Python. We will not rely on high-level libraries like Scikit-Learn. Instead, we will
implement the algorithm step-by-step to understand its inner workings.
2. Algorithm's Logic
The decision tree algorithm can be summarized in the following steps:
1. Select the Best Feature: Choose the feature that best splits the data using a metric like Gini impurity or
entropy.
2. Split the Dataset: Divide the dataset into subsets based on the selected feature.
3. Create Decision Nodes: Create nodes representing the decision rules for splitting.
4. Repeat Recursively: Repeat the process for each subset until a stopping criterion is met (e.g., all
instances belong to one class or maximum tree depth is reached).
• Gini Impurity: Measures the likelihood of an incorrect classification of a new instance if it was
randomly labelled according to the distribution of labels in the dataset.
• Entropy: Measures the uncertainty or randomness in the data.
3 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python
3. Fundamental Concepts
Before diving into the implementation, let's briefly discuss some key concepts:
I. Node
A single point of decision in the tree, which consists of a condition on a feature.
II. Root Node
The topmost node in the tree, where the first split is made.
III. Leaf Node
The terminal nodes that provide the final classification.
IV. Entropy
A measure of impurity or randomness in the dataset. It is used to determine the quality of a split. The
formula for entropy is:
4 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python
5 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python
5. Implementation in Python
Let's implement a simple Decision Tree in Python to classify data. We'll use NumPy, pandas and collections for
numerical computations.
import numpy as np
import pandas as pd
from collections import Counter
Entropy Calculation
The entropy function calculates the entropy of a given set of class labels. It counts the occurrences of each class,
computes the probabilities, and sums up the contributions of each class to the entropy using the formula.
def entropy(y):
# Calculate the entropy of label array y
class_counts = Counter(y)
total_instances = len(y)
entropy_value = 0.0
return entropy_value
The information_gain function computes the reduction in entropy achieved by splitting the dataset based on a given
feature. It calculates the total entropy and subtracts the weighted entropy of the subsets created by the split.
6 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python
class DecisionTreeNode:
def __init__(self, feature_index=None, threshold=None, left=None, right=None, value=None):
# Initialize the node
self.feature_index = feature_index
self.threshold = threshold
self.left = left
self.right = right
self.value = value
7 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python
Step 4: Define the Decision Tree Classifier Class
The DecisionTreeClassifier class handles the tree's growth and predictions. The fit method grows the tree
recursively using the _grow_tree method, which finds the best split and creates child nodes. The predict method
traverses the tree to predict the class label for each instance.
class DecisionTreeClassifier:
def __init__(self, max_depth=None):
# Initialize the classifier
self.max_depth = max_depth
self.root = None
# Stopping criteria
if depth >= self.max_depth or num_classes == 1 or num_samples <= 1:
leaf_value = self._most_common_class(y)
return DecisionTreeNode(value=leaf_value)
8 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python
# Sample dataset
X = np.array([
[0, 0],
[0, 1],
[1, 0],
[1, 1]
])
y = np.array([0, 1, 1, 0])
# Make predictions
predictions = tree.predict(X)
print(f"Predictions: {predictions}")
6. Conclusion
In this tutorial, we built a decision tree classifier from scratch in Python, covering the algorithm's logic, tree
construction, and pruning. We implemented the decision tree algorithm step-by-step and provided sample code
with detailed explanations. This implementation serves as a foundational understanding of how decision trees
work and can be a stepping stone to using more advanced machine learning libraries.
9 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python
10 ANSHUMAN JHA