Building A Decision Tree Classifier From Scratch

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Building a

Decision Tree Classifier


from Scratch in Python

1 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python

Table of Contents
1. Introduction
2. Algorithm's Logic
3. Fundamental Concepts
4. The Structure of a Decision Tree Classifier
5. Implementation in Python
a. Calculate Gini Impurity
b. Find the Best Split
c. Create Decision Nodes and Leaf Nodes
d. Build the Tree Recursively
e. Prune the Tree
f. Making Predictions
6. Conclusion

2 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python

1. Introduction to Decision Tree Classifier


A Decision Tree is a popular supervised machine learning algorithm used for both classification and regression
tasks. It splits the data into subsets based on the value of input features, recursively. The goal is to create a
model that predicts the value of a target variable by learning simple decision rules inferred from the data
features.

In this guide, we will explore the fundamental concepts and steps involved in building a Decision Tree
Classifier from scratch in Python. We will not rely on high-level libraries like Scikit-Learn. Instead, we will
implement the algorithm step-by-step to understand its inner workings.

2. Algorithm's Logic
The decision tree algorithm can be summarized in the following steps:

1. Select the Best Feature: Choose the feature that best splits the data using a metric like Gini impurity or
entropy.
2. Split the Dataset: Divide the dataset into subsets based on the selected feature.
3. Create Decision Nodes: Create nodes representing the decision rules for splitting.
4. Repeat Recursively: Repeat the process for each subset until a stopping criterion is met (e.g., all
instances belong to one class or maximum tree depth is reached).

Gini Impurity and Entropy

These are measures used to evaluate the quality of a split:

• Gini Impurity: Measures the likelihood of an incorrect classification of a new instance if it was
randomly labelled according to the distribution of labels in the dataset.
• Entropy: Measures the uncertainty or randomness in the data.

3 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python

3. Fundamental Concepts
Before diving into the implementation, let's briefly discuss some key concepts:
I. Node
A single point of decision in the tree, which consists of a condition on a feature.
II. Root Node
The topmost node in the tree, where the first split is made.
III. Leaf Node
The terminal nodes that provide the final classification.
IV. Entropy
A measure of impurity or randomness in the dataset. It is used to determine the quality of a split. The
formula for entropy is:

Where pi is the proportion of the class i in the dataset S.


V. Information Gain
The reduction in entropy achieved by partitioning the data based on a feature. It helps in selecting the
best feature for splitting.

Where Sv is the subset of S for which feature A has value v.

4 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python

4. The Structure of a Decision Tree Classifier


This Structure illustrates the step-by-step process of building a decision tree classifier from scratch in Python. It
visually represents the flow of data and logic, highlighting key functions, classes, and decision points -

5 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python
5. Implementation in Python
Let's implement a simple Decision Tree in Python to classify data. We'll use NumPy, pandas and collections for
numerical computations.

Step 1: Importing Required Libraries


We will start by importing necessary libraries like NumPy for numerical operations.

import numpy as np
import pandas as pd
from collections import Counter

Step 2: Define Functions for Entropy and Information Gain

Entropy Calculation

The entropy function calculates the entropy of a given set of class labels. It counts the occurrences of each class,
computes the probabilities, and sums up the contributions of each class to the entropy using the formula.

def entropy(y):
# Calculate the entropy of label array y
class_counts = Counter(y)
total_instances = len(y)
entropy_value = 0.0

for count in class_counts.values():


probability = count / total_instances
entropy_value -= probability * np.log2(probability)

return entropy_value

Information Gain Calculation

The information_gain function computes the reduction in entropy achieved by splitting the dataset based on a given
feature. It calculates the total entropy and subtracts the weighted entropy of the subsets created by the split.

def information_gain(X, y, feature_index):


# Calculate the information gain of a feature
total_entropy = entropy(y)

# Get the unique values of the feature


unique_values = np.unique(X[:, feature_index])
weighted_entropy = 0.0

for value in unique_values:


subset_indices = X[:, feature_index] == value
subset_y = y[subset_indices]
weighted_entropy += (len(subset_y) / len(y)) * entropy(subset_y)

return total_entropy - weighted_entropy

6 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python

Step 3: Define the Decision Tree Node Class


The DecisionTreeNode class represents a node in the decision tree. It can be an internal node with a feature index
and threshold or a leaf node with a class value.

class DecisionTreeNode:
def __init__(self, feature_index=None, threshold=None, left=None, right=None, value=None):
# Initialize the node
self.feature_index = feature_index
self.threshold = threshold
self.left = left
self.right = right
self.value = value

7 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python
Step 4: Define the Decision Tree Classifier Class
The DecisionTreeClassifier class handles the tree's growth and predictions. The fit method grows the tree
recursively using the _grow_tree method, which finds the best split and creates child nodes. The predict method
traverses the tree to predict the class label for each instance.

class DecisionTreeClassifier:
def __init__(self, max_depth=None):
# Initialize the classifier
self.max_depth = max_depth
self.root = None

def fit(self, X, y):


# Fit the classifier to the data
self.root = self._grow_tree(X, y, depth=0)

def _grow_tree(self, X, y, depth):


# Grow the tree recursively
num_samples, num_features = X.shape
num_classes = len(set(y))

# Stopping criteria
if depth >= self.max_depth or num_classes == 1 or num_samples <= 1:
leaf_value = self._most_common_class(y)
return DecisionTreeNode(value=leaf_value)

# Find the best split


best_gain = -1
split_index = None
split_threshold = None

for feature_index in range(num_features):


gain = information_gain(X, y, feature_index)
if gain > best_gain:
best_gain = gain
split_index = feature_index
split_threshold = np.unique(X[:, feature_index])[0] # Simplified threshold

# Grow the children recursively


left_indices = X[:, split_index] == split_threshold
right_indices = X[:, split_index] != split_threshold

left_child = self._grow_tree(X[left_indices], y[left_indices], depth + 1)


right_child = self._grow_tree(X[right_indices], y[right_indices], depth + 1)

return DecisionTreeNode(feature_index=split_index, threshold=split_threshold, left=left_child,


right=right_child)

def _most_common_class(self, y):


# Find the most common class in the label array
counter = Counter(y)
most_common = counter.most_common(1)[0][0]
return most_common

def predict(self, X):


# Predict the class labels for the input data
return [self._predict(inputs) for inputs in X]

def _predict(self, inputs):


# Predict the class label for a single instance
node = self.root
while node.value is None:
if inputs[node.feature_index] == node.threshold:
node = node.left
else:
node = node.right
return node.value

8 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python

Step 5: Testing the Implementation


Let's test our implementation with a simple dataset.
The sample dataset is used to test the implementation. The classifier is trained and predictions are made to verify the
correctness of the model.

# Sample dataset
X = np.array([
[0, 0],
[0, 1],
[1, 0],
[1, 1]
])

y = np.array([0, 1, 1, 0])

# Initialize and fit the classifier


tree = DecisionTreeClassifier(max_depth=2)
tree.fit(X, y)

# Make predictions
predictions = tree.predict(X)
print(f"Predictions: {predictions}")

6. Conclusion
In this tutorial, we built a decision tree classifier from scratch in Python, covering the algorithm's logic, tree
construction, and pruning. We implemented the decision tree algorithm step-by-step and provided sample code
with detailed explanations. This implementation serves as a foundational understanding of how decision trees
work and can be a stepping stone to using more advanced machine learning libraries.

9 ANSHUMAN JHA
Building a Decision Tree Classifier from Scratch in Python

Constructive comments and feedback are welcomed

10 ANSHUMAN JHA

You might also like